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PREFACE 


The CYBER 200 Applications Seminar, held on October 10-12, 1983, in Lanham, 
Maryland, under the sponsorship of NASA/Goddard Space Flight Center and 
Control Data Corporation, is the second of its kind. These proceedings comprise 
the majority of the papers presented at the meeting. Papers for the seminar 
were selected on the basis of showing a broad distribution of applications for 
which the CYBER 200 may be well suited. These ranged from problems in 
meteorology to problems in economics. A breakdown of the disciplines 
represented is shown below. Some of the papers actually could fall in more than 
one category, but only one is indicated for each. 


Papers 


Meteorology /Oceanography 5 

Chemistry 4 

Math Algorithms for 205 3 

Fluid Dynamics 3 

Monte Carlo Methods 3 

Petroleum 2 

Electronic Circuit Simulation 1 

Biochemistry 1 

Lattice Gauge Theory 1 

Economics 1 

Ray Tracing 1 


In the first seminar held in August 1982, it was evident that much work was yet to 
be done in learning to use a vector machine. At that time, only a few of the 
CYBER 205's had been installed. One year later, we see numerous examples of 
good vectorizing work carried out by still relatively inexperienced vector com- 
puter users. Clearly, in time we shall see a great deal more optimization and 
effective performance becoming routine. 
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Aileen Foreman 


Mathematical Algorithms to Maximize Performance in Numerical Weather Prediction 
Introduction 


Numerical weather prediction models, which involve the solution of non-linear 
partial differential equations at points on an extensive three-dimensional grid, 
are ideally suited for processing on vector machines. It was logical therefore 
that the new global forecast model to be implemented at the Meteorological Office 
should be written in vector code for the Cyber 205. 


In order to achive full efficiency and to reduce storage requirements the 
model used 32-bit arithmetic which had been found to provide high enough precision. 
Unfortunately, however, the trigonometrical and logarithmic functions provided 
by CDC could only handle 64-bit vectors and, although written in efficient scalar 
code, did not take advantage of the special facilities of a vector processor. It 
was therefore necessary to rewrite the functions in vector code to handle both 
32 and 64-bit vectors. There was also no half-precision compiler available for 
the Cyber 205 at that time and so the functions, like the model, had to make 
extensive use of the "special call" syntax. This made the code more difficult to 
write but it allowed much greater flexibility in that it became possible to access 
the exponent of a floating-point number independently of its coefficient. 


This paper presents a description of the techniques and it summarises the 
results which were achieved. One example, the logarithmic function, is treated 
here in detail to illustrate the general approach to the problem. 


Derivation of logarithms 


The coding for the logarithm function illustrates both the use of the way in 
which floating-point numbers are stored and the use of linked triads to gain 
additional speed. 


To calculate us I OJg(x^ 
which is ^ 


we divide the range of x into two, the first of 


a) x. ^ «/&' 


and x. < 

Z. 


of 

n 


We first write the value of x 
stored floating-point numbers, 
being an integer and _i_ < J 


in a way which can be related to the format 
Thus, introducing two new unknowns n and tJ , 
, we may write any number as x s 


Now the Cyber 205 stores the floating-point number as 


Z VXf> . coaffliciasit - 2** P . Z*. fe 
factor 2? is introduced by normalisation. 


where the 


Since for logarithms, oc must always be positive, for 64-bit numbers bit 17 
will be on, so j =46 and for 32-bit numbers bit 9 will be on, so j = 23. 

Then relating the two , we have n - <txp +- j and ujs k 
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As an example, if x • 2.0 as a 64-bit normalized value 
so from the above formulae 


ns -+** +6 


l 


and 


nf m !• O 


Here, we can obtain the values of n and u» very easily as we can access 
the exponent and coefficient of a number by using special calls. 

The next step is to convert the functions into a suitable form for veetorization 
and this involves the introduction of a new variable 


/ u>- V5?/a. \ 

I w 




time as u 


which can be computed at the same 


From the original definition 

x,l'” Vx 


Th&X I f g ^ 

’'inition 

(iff) 

thus s (a- ^ 

For the remaining values of x. , within the range x.< , the 

of *. is defined by: 2 ^ 


b) 
value 


a - 


a > 1 


so that 


x.s 1 + * 
i -a 


Then 


V a, 

lo<u|'l±5_ \ 

In each case, the problem then becomes one of vectorizing ) which 

is easily done by replacing it with a truncated series which gives the required 
degree of precision: 


for 


°3 « 


f^) - z 


a 


a.'-* i 


where the constants Cm 
Then 


mso 

are known. 


C((l (C c fc* a + c^aV c^) * x -fC, c,) a x + c*) a 

Despite its complicated appearance, this reduces to eight vector operations 
consisting of a multiplication, 6ix linked triads and a final multiplication 
by t thus 
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etc 


Multiplication to give 2* 

First triad = VI = 

Second triad = V2 = v i 

Third triad = V3 = 

Tests, using the 1.5 compiler, and a range of vector lengths gave the 
following results, with times being expressed in units of 10"^ seconds. 


Vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

CDC logarithms 

.3 

.55 

.7 

1.01 

2.00 

3.66 

7.04 

21.50 

64-bit vector 
logarithm 

.47 

.61 

.78 

1.12 

2.16 

3.87 

7.47 

20.15 

32-bit vector 
logarithm 

.53 

.57 

.65 

.82 

1.34 

2.20 

3.99 

9.66 


The first point to notice here is that the full increase in speed for 
32-bit vectors is only achieved with large vector lengths. Because of the 
overheads associated with the initiation of vector instructions, this is not 
unexpected and is common to all of the functions to be described. Vhat is 
unexpected is that no improvement in speed was achieved for our 64-bit function 
when compared to the CDC function. In this respect, this function is unique 
among all those treated in this paper. However, the original aim of producing 
a 32-bit version has been successfully achieved. 


Exponentials 


The exponential function is derived from the standard formula 

chosen to make use of special calls, k, m and f 


e*. 1* .z""- .1*'* 

are defined as follows: 
If rv = tfit 




then 

and 


log, 

fe r u\b 


m - 1 


f s f lix \ - * 

\ lojeZ ) 


and mn o modulo 16 for pc ^<3 

and ^ - 1^ - n modulo 16 for ?c<0 


How, since *n is integer and O <. m < 16 , the factor 2* is 

obtained from a look-up table of 16 elements of known values, using the "special 
call" instruction Q8 VXTOV . 

Having found the integer ^ from the above formula, and from the 

look-up table, to obtain the value 2,*. we add to the 

exponent part of 2 rnn<P by using special calls. 
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The factor, * is given by 

i x ftS* * £ x * ?.£ ♦ p ° 

-to 4 -r+p.;-p=> 

where f is obtained as above and p# ; Pi , P? are known constants* 

flto Jtkifa _ |d4 ^ j ^ 

Then, to obtain £ all we need is a final multiply of 3f by 2. 

The following results were achieved, times again being given in units of 
10 seconds. 


vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

CDC exponential 

.35 

.7 

.93 

1.44 

2.86 

5.25 

10.52 

33.36 

64-bit vector 

.47 

.6 

.78 

1.14 

2.29 

4.15 

7.97 

22.75 

exponential 
32-bit vector 

.47 

.56 

.68 

.93 

1.85 

3.14 

5.85 

14.62 


exponential 

Here, for a vector length of 5000 the 32-bit exponential routine is only 
40# faster than the 64-bit routine because of the use of the "special call" 
Q8VXT0V. However the 64-bit routine has achieved a considerable speed-up over 
the GDC exponential. 

The Hyperbolic functions 

The routines to calculate the hyperbolic functions a- c*$k.x 
and bcLnhzc use the following formula, J 1 

C06k X. - 

3C X 

The calculation of i is as described earlier** During the calculation of e. , 
little extra work is required to obtain Q?*" which avoids the need to call the 
exponential routine twice. 

The hyperbolic sine is given by 


sir \h x 5 ( £ X e/"*" ) 

x 1 


for \rt-\%,o.S 

~ xrr\4 1 

and SfrVt hx iz ) ** 

L — 1 
( ms 0 


for 1*. | < 0 . S 

Here the two distinct cases are treated independently, so that we are dealing 
with shorter vector lengths, and then the results axe merged together at the end 
of the routine. The polynomial expansion of sinh x can be performed in seven 
vector instructions, by using linked triads. 

The hyperbolic tangent is given by 



rr\~0 


for O < |at| £ O.l^. 
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kasihx r 1 - 2- 

for 

0. 12. < latlrf 18.0 




fcjLrttlX r 1.0 

for 

at y |g.O 

tosihaC* -1-0 

for 

ac < - i 8.0 


Again, the distinct cases are treated independently so that we are dealing 
with shorter rector lengths, and again we can use linked triads when calculating 
the polynomial expansion of kashac . 

The timings of the hyperbolic sine and hyperbolic tangent routines are data 
dependent, but some sample timings are given below. All times are expressed in 
units of 10 -it seconds. 


vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

hyperbolic 

cosine 

64-bit vector 

.55 

.79 

1.08 

1.68 

3.45 

6.41 

13.26 

37.65 

32-bit vector 

.54 

.69 

00 

co 

• 

1.27 

2.44 

4.44 

8.72 

22.99 

hyperbolic 

sinh 

64-bit vector 

.75 

.99 

1.30 

1.96 

3.88 

7.27 

14.87 

43.85 

32-bit vector 

.72 

.87 

1.07 

1.48 

2.74 

5.00 

9.47 

24.38 

hyperbolic 
tangent 
64-bit vector 

.66 

.87 

1.15 

1.68 

3.33 

6.01 

11.79 

34.83 

32-bit vector 

.64 

.73 

.89 

1.21 

2.30 

3.66 

6.87 

17.76 

Again, we see 

that for 

very 

short 

vector lengths we 

do not 

have a 


advantage by using 32-bit vectors, but for longer vector lengths we are approaching 
twice the speed of the 64-bit functions. There were no CDC functions available to 
compare with our results. 

Sines and cosines 


The trigonometrical functions, scrtx and jscosx. are calculated from 
the polynomial expansion of tC<\x. so that we can make use of linked triads 

again. First the input argument needs to be reduced modulo 2.1T . This is achieved 
by 


letting -r, 

s Z 1*1 


and *x ~ f 

then put 

r* 


1 

so that Os*<l- 

and modulo 4 

So is given by 

S’tYoc.s Jw'n a 

for 

ksO 


s*V» (1 - a) 

for 

fc*l 


-ScV\ a 

for 

k= 2 


- sin (1 - a) 

for 

k=3 
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for 64-bit function 


< i ’*‘ 22 


where su\2 - / 

H'S.O 


Zsn-r i 


and the constants C m are known* 


Because the values C > and c? a re too small to affect the accuracy of the 
32-bit function results: 


soli 




Zsn+~l 


for 32-bit vector function 


The cosine function is given by 
cotx, - sun ^ JT -h 


where Stn 




is calculated as above. 


If it is known that the input operand, x, is always between -2TT and +2-J> 
radians, much work can be left out of the routine; 


for as above let 5 Z | *. I 


and 


r n 

- Ub Z 1^1 

l IT ] 


and So < 3 


and kr. mocLulo 4 * 




and again 2 r t { 

~-r z jo O 

So for 

*^ r jL -0, 

* = , 

sCnx. ~ sCnfck. 

for 

II 

£ 

II 

-Of 

i } 

su\ x = sCr\ ( 

for 


i- r l -r^^ x -z > 

-X ^ - $ tn 

for 


a = r r r 1= r t -3 ) 

5ul; C- - Sirt( 


Thus we have two sets of functions, one set to calculate the sine and cosine 
of any angle expressed in radians, and the other to calculate the sine and cosine 
of angles between -Z7T and * Z77" radians. 

The polynomial expansion of sin(z) can be calculated in ten vector instructions 
including eight linked triad instructions for the 64-bit function and in eight 
vector instructions using six linked triad instructions for the 32-bit functions. 

^ Tests gave the following results with times given are expressed in units of 
10 seconds. 


▼ector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

CDC sine 

.15 

.5 

.64 

.91 

1.72 

3.07 

6.13 

22.98 

64-bit rector 

.49 

.59 

.72 

.98 

1.74 

3.02 

5.59 

14.98 

sine (all angles) 
32 -bit vector 

.42 

.46 

.52 

.63 

.98 

1.57 

2.76 

6.35 

sine (all angles) 
64-bit vector 

.37 

.44 

.53 

.72 

1.27 

2.20 

4.07 

10.04 

sine ( - ZTf 
to +XTT ) 
32 -bit vector 

.34 

.37 

.41 

.50 

.75 

1.20 

2.09 

4.78 


sine 


to + Z7T ) 


vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

CDC cosine 

.3 

.55 

.68 

.99 

2.08 

3.29 

6.68 

23.59 

64-bit vector 

.57 

.60 

.73 

.99 

1.87 

3.19 

5.94 

16.00 

cosine (all 
angles) 

32 -bit vector 

.69 

.47 

• 51 

.63 

1.0 

1.70 

2.94 

6.95 

cosine (all 
angles) 

64-bit vector 

.72 

.45 

.55 

.74 

1.42 

2.40 

4.45 

11.14 

cosine (-Z 7 T 
to +ATT ) 
32 -bit vector 

C^ 

VO 

• 

.37 

.41 

.50 

.77 

1.37 

2.31 

5.51 


cosine ( - * 1 T 
to + XTf ) 


Thus, we can see that we need a rector length of 500 to 1000 before our 
64-bit routines for *11 angles are faster than the CDC supplied routines, but 
that our 32 -bit routines for restricted angles between -Z7T and +X7T are 
over four times as fast as the CDC routines for rector lengths of 5000. 

Tangents 

Similarly for the trigonometrical function, tanx we hare supplied 
two sets of functions, one set to calculate the tangent of any angle expressed 
in radians in both 64-bits and the other to calculate the tangent of angles 
between -27 T and r 2JT radians in both 64-bits and 32-bits* The tangent 
function is calculated using a polynomial expansion of tan(x) to make use of 
linked triads* The calculation is performed by first reducing the argument 
modulo nr 


Let r, s 

and 4ji s 

wit 

1— 1 

IT 



I TT ' 

then — i r t 

so that 

<?$■*<! 


Now let t modulo 8 , putting M=5 if 

** and k* 5 -4 if 44 s 
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txnfje.') is now given by 


ton £a) 

for 

*t-(J 

- 1 

for 

ksl 

ban C*~i) 

-/ 

for 

fcs4 

banC*) 

tasx, (t-i) 

for 

*=3 


is 

where b&n,(t). c rr) w x *'*‘ to the required degree of precision, 

r *£0 

Again, if it is known that the input operand is always between - Z7T and 
+ Z7T radians, we can writes 


r, . and = >'<%t |j 4*. jj 

and so C ^ < ?■ 

In this case * * modulo 8 = 

Then where 

and Jt-vrjt-'f where + 

and the calculation continues as before. 

The polynomial expansion of tan(z) is calculated in fourteen vector 
instructions using twelve linked triads. 

The resulting timings of tests are given below, expressed in units of 10 
seconds. 


vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

CDC tangent 

.98 

.73 

.91 

1.47 

2.61 

4.71 

9.33 

30.80 

64-bit vector 
tangent (all 
angles) 

.90 

.82 

.99 

1.35 

2.55 

4.48 

8.40 

22.67 

32-bit vector 
tangent (all 
angles) 

.96 

.78 

.90 

1.14 

1.92 

3.21 

5.59 

13.29 

64-bit vector 
tangent (-WT 
to +ZTT ) 

.67 

• 

CT\ 

.93 

1.25 

2.36 

4.14 

7.74 

20.64 

32-bit vector 
tangent ( -ATT 
to *• XIT ) 

.67 

.70 

.80 

.99 

1.76 

2.94 

5.15 

11.98 
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These results show that ve need a vector length of only about 200 before 
our 64-bit tangent function for all angles is faster than the CDC routine, and 
that our 32-bit tangent function for restricted angles between - 2T and 
+ tit" radians is well over twice as fast as the CDC routine. 

The Arctangent function 

The arctangent function a.taA. C^) is again calculated from a polynomial 
expansion so that we can use linked triads. The calculation is performed as 
follows: 

For |sc| +1 

and for | *• / < V? + / 


Change the variable to a, defined by 

« = j) - a. 

a, + 

where, a is chosen so that z = 1.0 when 

Under this condition, a. = (i - V ? ) + V 4 - ZvCT , and is therefore a 

constant. 

Then atan(x) is given by 

atan ( x ) =atan ( z ) +atan ( a) 

Here, atan(a) is a constant and need only be calculated once, and we may replace 
atan(z) by the truncated series: 

a. ( * ) = > \ Cry, S Xm * * 

For l-jt-l >. V2V I a,taj\(sc) s IT - ( J.) 

and for *<© , a4a/v£x.) = - a.k&n (x.) 

Atan(z) can be calculated in ten vector instructions, eight of which are 
linked triad instructions. The results are in the range -]T to +TT (not 
inclusive). 4. 

The following results were achieved, times again being given in units of 
10" seconds. 


vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

CDC arctangent 

.38 

.90 

1.19 

1.97 

4.06 

7.35 

16.19 

46.09 

64-bit vector 
arctangent 

.48 

.52 

.66 

.92 

1.91 

3.07 

5.77 

15.23 

32-bit vector 
arctangent 

.43 

.49 

.55 

.69 

1.10 

1.79 

3.34 

7.27 


These results are spectacular, in that the 32-bit arctangent function is 
over six times as fast as the CDC routine and even the 64-bit version has given 
a threefold increase in speed. 


let 

tO ~ i 


1*/ 

let 

to s |ac/ 
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Derivation of arcsine and arccosine functions 


The final trigometric routines to be considered calculate the arcsine and 
arccosine of x. The calculations are performed as follows. 

for 0 $ x £ */z » l®t is» so that asin(x) = asin(z) 

and for x c x < l , let z- = ( l ' * V 3 * and asin(x) = 

*■ la.-' X 

for -/ sa <0 , asin(x) = asin(-x) and the same substitutions are used. 

Now the new variable, z, must be between zero and 0.7 so we may write 

n 


precision. 


to the required degree of 


The arccosine function is derived from the arcsine using the substitution 

a to 5 (x.) r 77* - 45t/*£e) 

X 

The polynomial expansion of asin(z) is calculated in thirteen vector 
instructions, eleven of which are linked triads. The range of the results for 
arccosine is -77* to +E inclusive, and for arccosine is 0 to'7T inclusive. 
1 Z 

_4 

The following results were achieved, with times expressed in units of 10 
seconds. 


vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

CDC aecsine 

.5 

.67 

.87 

1.27 

2.6 

4.73 

9.64 

29.84 

64-bit vector 

.52 

.61 

.75 

1.04 

2.02 

3-55 

6.69 

16.54 

arcsine 
32 -bit vector 

• 54 

.51 

.58 

.73 

1.37 

2.25 

3.91 

9.11 


arccosine 


vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

CDC arccosine 

.26 

.68 

.89 

1.27 

2.41 

4.35 

9.16 

28.55 

64-bit vector 
arccosine 

.51 

.61 

.76 

1.05 

1.95 

3.44 

6.44 

18.73 

32 -bit vector 
arccosine 

.48 

.54 

.61 

.76 

1.25 

2.07 

3.66 

8.59 


Here our 32-bit functions are over three times as fast as the CDC routines, for 
vector lengths of 5000 . 


Conclusion 


The trigonometrical and logarithmic functions, as provided by CDC up to and 
including version 2.0 of the compiler are, in general, not very efficient. At 
the Meteorological Office, we found it necessary to hand-code these functions in 
vector syntax to take full advantage of the facilities of the Cyber 205. For the 
32 -bit versions, which have a high enough precision for most of our purposes, 
speed increases of up to six times were obtained and even for our 64-bit versions. 
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increases of up to three tines are possible. However, CDC have undertaken to 
provide fully vectorized versions of the trigonometrical andlogarithaic functions 
in both 64-bits and 32 -bits by release 2.1 of the compiler. 

The functions described were written in the "special call” syntax because 
of compiler limitations and the difficulties associated with this were partly 
offset by the special features which were then available. Users with the 2.0 
compiler could find that the extra facilities provided by the "special calls" 
do not overcome the difficulties involved with this syntax and that coding 
explicitly in the FORTRAN vector syntax achieves sufficient vectorization for 
their own purposes. 
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ABSTRACT 


The complete global specification of the state-of-the- 
atmosphere on a daily or more frequent basis is required for 
numerical weather forecasting. Although the number of 
atmospheric variables required are small, namely, temperature, 
winds, moisture and surface pressure, globally and throughout 
the atmosphere, no single space-borne instrument is able to 
meet these requirements at the desired degree of accuracy and 
coverage. As a result, investigators have proposed to NASA a 
number of composite systems with differing limitations in 
accuracy and coverage under different atmospheric conditions. 

Because of the extreme expense involved in developing and 
flight testing these instruments, an extensive series of 
numerical modeling experiments to simulate the performance of 
these meteorological observing systems have been performed on 
the CYBER 205. The studies compare the relative importance of 
different global measurements of individual and composite 
systems of the meteorological variables needed to determine the 
state of the atmosphere. The assessments are made in terms of 
the systems ability to improve 12 hour global forecasts. Each 
experiment involves the daily assimilation of simulated data 
that is obtained from a data set we call "nature." This data 
is obtained from two sources: first, a long two-month general 
circulation integration with the GLAS 4th Order Forecast Model 
and second, global analysis prepared by the National 
Meteorological Center, NOAA, from the current observing systems 
twice daily. More than two dozen experiments representing 
different possible configurations were carried out and 
analyzed. The experiments extend over a typical winter month, 
February, and successive 12 hour forecasts are made from the 
analysis twice daily. Thus, statistics are compiled from a 
total of 56 forecasts for each experiment. 

This voluminous number of experiments would have taken over a 
year on a dedicated 24 hour per day allocation on an Amdahl 
V-6. The study was completed in less than a month on an as 
available basis on the Cyber 205 at the NASA High Speed 
Computing Facility. 
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Operational Numerical Weather Prediction on the Cyber 205 at 
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The Development Division of the National Meteorological Center ( NMC) 
has the responsibility of maintaining and developing the numerical 
weather forecasting systems of the center. Because of the mission of 
NMC these products must be produced reliably and on time twice daily 
free of surprises for forecasters. Personnel of Development Division 
are in a rather unique situation. We must develop new advanced techniques 
for numerical analysis and prediction utilizing current state-of-the-art 
techniques , and implement them in an operational fashion without 
damaging the operations of the center. 

In the past, modifications have been made to the operational job 
suite without adequate testing and evaluation because computational 
resources were not available to produce enough case studies for evaluation. 
Hopefully, with the computational speeds and resources now available from 
the Cyber 205, Development Division Personnel will be able to introduce 
advanced analysis and prediction techniques into the operational job 
suite without disrupting the daily schedule. 
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The operational job suite prior to the installation of the Cyber 
205 contained four major components: 1. A barotropic numerical model 

extending over the Northern Hemisphere giving forecasters an early look 
at the new synoptic situation immediately after data collection at the 
start of the twice daily operational cycle. 2. A Limited Fine Mesh 
(LFM) primitive equation numerical model extending over the North 
American continent. The LFM is started about 1 hour 45 minutes after 
data collection producing numerical guidance for use by forecasters 
when they make their 12 to 48 hour forecasts. 3. A global primitive 
equation numerical model using a spectral representation to produce 
numerical guidance for use by forecasters in the 2 to 5 day range. 

This model is started at about 4 hours after each twice daily collection 
of atmospheric data. 4. A global data assimilation cycle is started 
about 10 hours after data collection and is used to produce the first 
guess fields for the next synoptic cycle. The data assimilation cycle 
consists of an optimum interpolation analysis and a global spectral 
model which are used to produce two six hour analysis/ forecast cycles. 

In addition to these four major components, a Moveable Fine Mesh model 
is available when needed to produce forecasts of hurricane movement. 

The hurricane model has the capability to move with the hurricane as it 
forecasts the storm track for periods of 48 hours. 


22 



The operational implementation of these analysis/ forecast systems on 
the Cyber 205 will have to proceed in a careful controlled manner so that 
daily production schedules are maintained. For this reason, each component 
of the operational suite must be carefully evaluated and tested after 
conversion to the Cyber 205. All components of the present system scheduled 
for implementation on the Cyber 205 will be converted in their present 
form with the current resolution and numerics in order to evaluate their 
performance in a parallel fashion. After about a month of successful 
parallel tests the component will become operational on the Cyber 205. 

The National Weather Service received their Cyber 205 in May of 1983 
and the first operational product appeared on August 30, 1983. The LFM 
was successfully implemented on the Cyber 205 and has been producing 
numerical guidance twice a day since that time. The final version of 
the LFM computer program that was implemented takes about 75 seconds of 
CPU time to produce a 48 hour forecast. This is about 15 times faster 
than the IBM/195 version of the same model. The LFM is a grid-point 
model containing 7 layers with 53 x 45 grid points in each layer. Five 
prognostic variables (pressure, temperature, moisture, and two components 
of wind speed) are specified at each of the 16,695 grid points. The 
primitive equations are solved in finite difference form for each of the 
prognostic variables and then advanced forward in time with an explicit 


time step. Nine 400 second time steps are required for each hour of model 
integration which yields a total of 432 explicit time steps to produce 
a 48 hour prediction. 

The conversion of the LFM computer code to the Cyber 205 was accomplished 
in about 1.5 months by a skilled meteorologis t/ programmer . The 2.0 FORTRAN 
compiler was used to produce a half precision version without resorting 
to 08 special calls. The data structure of the original version of the 
model was changed extensively to take advantage of long vector lengths. 

Minimal vectorization of the radiation and moist physics was achieved 
with use of the vector WHERE statement. 

Operational use of the Cyber 205 has shown that the system is certainly 
reliable and capable of achieving vendor advertised CPU speeds. With 
this new resource the National Weather Service should be able to improve 
most aspects of numerical weather prediction systems including the 
prediction of major precipi tation events. With the increase in computing 
power, the National Weather Service will be able to run operational 
numerical guidance systems with improved analysis methods, improved 
model physics and increased mathematical accuracy. 
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Ocean modelling on the CYBER 205 at GFDL 
Michael D. Cox 


1. Introduction 

At the Geophysical Fluid Dynamics Laboratory, research is carried out for 
the purpose of understanding various aspects of climate, such as its 
variability, predictability, stability and sensitivity. The atmosphere and 
oceans are modelled mathematically and their phenomenology studied by computer 
simulation methods. The present paper will discuss the present state-of-the- 
art in the computer simulation of large scale oceans on the CYBER 205. While 
atmospheric modelling differs in some aspects, the basic approach used is 
similar . 

The equations of the ocean model will be presented in the following 
section along with a short description of the numerical techniques used to find 
their solution. Section 3 will deal with computational considerations and a 
typical solution will be presented in section 4. 


2. Equations of the model 

The model presented here is the multilevel numerical model described in 
Bryan (1969). The continuous equations will be given. A detailed description 
of the finite difference formulation may be found in the above work. The 
equations of motion are the Navier-Stokes equations written in spherical 
coordinates and modified by the Boussinesq approximation. Let m=sec0, 
n=sin0, u = aXm 1 and v = a0, where a is the radius of the earth, 

0 the latitude and X the longitude. It is convenient to define the 
advect ion operator 

r^)=M‘ 1 t(ui) x *(v^r 1 ) 0 ]*(w^) z . (i) 

The equations of motion on a sphere are 

u t *r(u)-2nnv*-M -1 (P//°o ) )x +FX ' <2> 
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v t +r(v)+2nnu= -a _ 1 <P/P o > 0 *F 0 , 


<3) 


r(i)=o. 


(4) 


gp=-P Z ' 


(5) 


where P 0 is unity in cgs units. The conservation equations for the 
temperature and salinity are 

T t +P<T)=F T (6 

S t T(S)=F s <7 

The terms in F contain effects of mixing as well as external driving forces. 
The equation of state 


P=P(T,S,z) (8) 

is an empirically derived formula relating the local density of seawater to 
temperature, salinity and depth. 

The set of equations (1-8) are cast into finite difference form. The 
prognostic equations (2, 3,6,7) are solved as an initial value problem, placing 
all terms except the local time derivative on the right hand side and carrying 
out timesteps to predict new values of velocity, temperature and salinity on a 
prescribed mesh covering the model ocean domain. Given a certain configuration 
of steady wind driving and differential surface heating (both entering through 
the F terms), a statistical steady state is approached asymptotically in time. 
Time scale analysis of Eqs.(6,7) reveals that 0(1000) years of integration is 
needed to bring the sluggish abyssal layers of the ocean model into a steady 
state. 
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3. Computational considerations 

Let us consider a rectangular ocean basin model comparable in size to the 

N. Atlantic Ocean. It extends 60° in longitude, 85° in latitude and 

4000 meters in depth. It is desirable to cover this domain with a mesh fine 

enough to resolve mesoscale (0(100 km)) eddies which play an important role in 

transporting various properties through the ocean. The minimum resolution 
needed for this purpose is roughly l/3rd degree in latitude and somewhat 
larger, say .4 degree in longitude due to the convergence of meridians on the 
globe. This results in a horizontal grid space of 150x195 points. Vertically, 
18 levels are needed to resolve the scales of interest. This brings the total 
to just over 1/2 million grid points for which Eqs.(l-8) must be evaluated each 
timestep. 

The longest timestep which can be used without incurring numerical 
instability is given by the Courant-Fr iedrichs-Lewy condition 

cAt/Ax<l (9) 


where c is the phase velocity of the fastest moving wave in the ocean. Since 
high speed external gravity waves have been filtered from this model by the 
condition w-0 at the surface, the fastest wave is that associated with the 
internal density gradients (internal gravity wave) which has a speed of roughly 
3m/sec. The smallest Ax occurs at the northern wall of the model due to 
convergence of meridians, and is about 20 km. The resultant At is such 
that roughly 5000 timesteps are necessary to integrate one year. Therefore, 5 
million timesteps, or 2.5x10*2 grid point evaluations of Eqs.(l-8), are 
required to integrate this model to a steady state. Even the fastest modern 
day computers cannot accomplish this task in a reasonable time, although steady 
progress is being made. The former computer at GFDL, the Texas Instruments 
ASC, took 15 seconds to compute one time step on the above model. At this 
speed, 2.4 years of computing would be needed to reach a steady state solution. 
Clearly, compromises must be made in designing experiments which are achievable 
in a reasonable amount of computer time. This may involve reducing the domain 
size, or integrating for a shorter period, or both. (Interesting results may be 
obtained from an integration of 0(10) years, particularly for the upper ocean 
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where time scales of adjustment are relatively short.) The greater the 
computational speed which can be attained, the less severe the compromises must 
be. 

In converting the ASC ocean model to the CYBER 205, the most fundamental 
alteration of the code had to do with the treatment of land masses. 

Previously, the computation was carried out only over ocean points by making 
the DO loop limits functions of the placement of land. The contiguity 
requirement of the 205 for vectorization allows only the innermost of the three 
dimensional loops to vectorize in this case. An alternative method of handling 
land is to compute all points as if they were ocean and, at the end of the 
timestep, restore the land to its specified value using a masking array. 
Contiguity is then satisfied and vectorization is enabled through two 
dimensions. (The third dimension cannot be vectorized because it is cycled 
through memory from disc.) By using the latter technique, the typical vector 
length in the computation is increased from 150 in the example above (east-west 
dimension) to 2700 (east-west times depth dimension) resulting in a 
considerable decrease in the relative time spent in vector startup. 

An additional time saving has been accomplished in an area of the code 
which is used heavily, but is inherently unvector izable due to a recursive 
property. Using Q8 calls to insert machine language directly into the FORTRAN, 
CDC personnel have “unrolled" this loop, greatly improving on the code 
generated by the compiler for the equivalent FORTRAN loop. 

The use of half -precision on all floating point variables has resulted in 
a gain of only about 15% in overall running speed, although sections of the 
code which are 100% vectorized increase in speed by roughly 40%. Additional 
work is needed to determine why the overall gain is so small considering the 
high degree of vectorization of the code. 

Since the model above is too large to fit into core memory entirely, data 
is cycled through memory from disc as it is needed each timestep. If this disc 
transfer cannot be buffered sufficiently well, computation ceases while waiting 
for the 1/0 to finish. The result is that the computer may not be used 
efficiently, particularly if the other jobs running concurrently have the same 
difficulty. Until recently, this was a severe problem on the 205. The above 
model, when in the 205 alone, ran only about 15% of the wall clock time. 

Improved 1/0 schemes have been developed by CDC personnel at GFDL and currently 
the same model runs about 80% of the wall clock time when alone. This compares 
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favorably with I/O efficiencies on the ASC. 

The CYBER 205 version of the model described above currently takes 4 
seconds to compute one timestep, almost a factor of 4 faster than the ASC. 

While this speed still does not make the experiment proposed at the beginning 
of this section feasible, the compromises which are necessary to produce an 
attainable solution are much less severe than before. One such experiment will 
be described in the following section. 

4. An ocean simulation experiment 

If one wishes to study the effects of topography on the dynamics of the 
Gulf Stream, an argument can be made that it is not necessary to consider a 
domain as large as the one proposed earlier, and that several decades of 
integration is sufficient. Therefore, let us reduce the domain from 65 to 27 
degrees in latitude and from 60 to 32 degrees in longitude. Also, for this 
purpose, the vertical resolution may be decreased from 18 layers to 5 layers. 
This produces a model which takes approximately one hour of 205 time to inte- 
grate one year of ocean time. Applying surface wind stress and differential 
heating similar to that of the N. Atlantic, this model has been integrated from 
rest a total of 20 years. The resulting temperature pattern at the second 
layer, centered at 212 meters depth, is shown in Fig. 1. The land mass in the 
northwest corner simulates the gross features of the U.S. east coast. A conti- 
nental shelf and slope is also included in this solution. The simulated Gulf 
Stream is revealed by the tightly packed isotherms along the coast and bending 
out to sea at the point representing Cape Hatteras. In agreement with 
observations, there exist both cold and warm core "rings" which have broken 
from the Stream and are drifting westward. An example of the former is 
centered at about 70°W, 30°N and of the latter at 68°W, 37°N. 

Three other experiments have been carried out in this series, altering the 
topography along the western boundary to study its effect on the path and 
behavior of the Gulf Stream. 

References 
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1. INTRODUCTION 

Numerical Weather Prediction (NWP), for both operational and research purposes, 
requires not only fast computational speed but also large memory. In this paper I will 
discuss a technique for solving the Primitive Equations for atmospheric motion on the 
CYBER 205, as implemented in the Mesoscale Atmopsheric Simulation System (MASS) 
(Kaplan et. al., 1982), which is fully vectorized and requires substantially less memory 
than other techniques such as the Leapfrog or Adams-Bashforth Schemes. The technique 
to be presented uses the Euler-Backard time marching scheme. 

Also to be discussed will be several techniques for reducing the CPU time of the 
model by replacing "slow" intrinsic routines by faster algorithms which use only hardware 
vector instructions. 



2. MODEL BACKGROUND 


2.1 Description 

MASS is a hydrostatic primative equation model which is run over a limited 
area. The model forecast the 3-dimentional structure of wind, pressure, 
temperature and moisture. The actual domain of coverage, along with the 
horizontal distribution of grid points, is depicted in Fig. 1. The characteristics of 
the model are listed in Table 1. 

2.2 Uses and Support 

The model has been applied primarily to the problem of forecasting the 
atmospheric environment within which severe local storms (severe thunderstorms 
and tornadoes) are likely to develop. It has also been applied to the problems of 
forecasting and investigating east coast cyclogenesis, upper level turbulence and 
shear, and boundary layer transport. Support for the model development has been 
provided by NASA/Goddard using the computational facilities of NASA/Langley 
(CYBER 203) and NASA/Goddard (CYBER 205) 

2.3 History 

The original version was implemented on a 500K word CDC STAR 100 Vector 
Processor at NASA/Langley in the late 70's using 64-bit FORTRAN. The 
availability of the SL/1 programming language at Langley, which permitted easy 
access to the 32-bit instruction set on the STAR 100, resulted in an effective 
doubling of the memory and the model was recoded with larger vectors. This 
allowed for an increase in the area over which the model was run while maintaining 
the same horizontal and vertical resolution. 
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Domain of coverage bv MASS model and the horizontal 

OF GRID POINTS DISTRIBUTION 




Table 1 Characteristics of MASS model 
MASS (Description) 

O H YDROSTATIC PRI M ITIVE EQUATIONS 
0 T EffiAIN FQLOWIHG SlGM A~P COORDINATE 

o Limited Area Domain 

O C ARTESIAN GRID ON A Pd-AR STEREOGRAPHIC M AP (ARAKAWA "A" GRID) 

o ^th Order Accurate Horizontal Space Dfferencing 
o ' 2nd Order Accurate Vertical Space Dfferencing 
o 2nd Order Accurate Time Differencing 
o Initial Data is derived from the LF M Analysis r.us R awinsondes 
o Initiajzation is based on the C alcllus of Variations 
o Physics 

- Large Sca_e Precipitation 

- Planetary Boundary Layer 

- Dry Convention 

- M OIST C ONVECTION (UNDER DEVELOP M ENT) 

o 50 K m Grid Spacing at *6°N 
o 19 Equally Spaced Layers 

o 12b X 96 Computational Domain 

o Time D ependent Boundary C onditions 
o Comprehensive Interactive Diagnostic Package on the Front End 

- Vertical Profles 

- Vertical Crossections 

- Constant Pressure Surfaces 
-Time History 

- Trajectories 

- Vfwifi cation Statistics 
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In the spring of 1980, the STAR 100 was upgraded to a lm word CDC CYBER 
203. The new machine effectively had twice the memory of the STAR 100. The 
area over which the model is run was again expanded and the vertical resolution 
was increased from 12 to 14 vertical layers. 

In the spring of 1983, the model was transferred to the NASA/Goddard 
CYBER 205. The model was recoded in CDC FORTRAN 2.0 using 32-bit 
arithmetic. After being successfully benchmarked against the Langley version, the 
vertical resolution was again increased from 14 to 19 layers. The Goddard version 
of MASS on the CYBER 205 executes approximately 3 times faster than the 
Langley version on the CYBER 203. This can be explained by 

DReduction in cycle time from 40 to 20 NS. 

2) Linked triad instruction on the CYBER 205. 

3) Faster gather/scatter instruction. 

4) Coding differences. 


3. EQUATION SET 

The model utilizes a standard primitive equation set cast in a terrain followingCp 
coordinate system. As indicated earlier, the forecasted variables are the 3-D 
distribution of wind, pressure, temperature and moisture. The basic prognostic equations 
are given below where u and v are x and y coordinate momentum, T is temperature, q is 
the moisture mixing ratio and IT is the pressure at the terrain minus the pressure at the 
top of the model. 
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Three diagnostic equations close the system and are given below where Is the 
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vertical velocity, f> is the geopotential energy andcois the vertical velocity in pressure 
coordinates. 



The boundary conditions are 

cr, = <h - o 


‘P'/i = ~r &n 






Bu>ur 


and the definitions for and TTare 


P'Ptop TT= p s ^- f t , p 
IT 

the remaining variables are 
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m = 


c p = 

R= 

^sur = 

^top = 



mapscale grid transformation factor 
specific heat at constant pressure 
gas constant for dry air 
pressure at the terrain 
pressure at the top of the model 
horizontal eddy diffusivity 


4. GRID SYSTEM 

The technique for solving the differential equations is to discretize the equations 
into finite difference form and solve them on a 3-D grid. The horizontal grid employed is 

the Arakawa "A" grid where all dependent variables are defined at all grid points. The 
vertical grid is staggered so that u, v, T and q represent layer averages defined at the 

mid-point of each layer and and are held at the layer interfaces. The third diagnostic 
variable, w, is held with u, v, T and q. This structure is represented in Fig. 2. 


5. NUMERICAL TECHNIQUE 


5.1 Horizontal Space Derivatives 

The fourth order accurate finite difference approximation to an x-direction 
space derivative for an arbitrary variable ^ is given below 




i * 



where i is a horizontal index; An analogous formula is used for y - direction 
derivatives. 
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Fig. 2 Vertical grid system of MASS 
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5.2. Vertical Space Derivatives 


A second order accurate finite difference formula is used to approximate the 
vertical advection terms of the u,v, T and q prognostic equations. The 
representation, for an arbitrary variable^, is given below 

- 1 

where k is a vertical index. 

5.3 Time Derivatives 

A second order accurate approximation to the time derivatives is used. The 
Euler-Backward Technique has the properties of frequency dependent damping and 
no computational mode. For an arbitrary variable ^ the finite difference 
representation is given as 


<r 

;>o~ 


<3A<r, 


(fJ n +- dl\ t 


Prediction 







2 
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Correction 


where n is a time level index and * refers to a intermediate time level. 


§ 
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This scheme requires the storage of only one time level of information (time 
level n) whereas other explicit schemes such as the Leapfrog Scheme requires the 
storage of at least two time levels (n and n-1). The penalty is that twice the 
computational work is required as compared with the Leapfrog scheme. 


6. BASIC MEMORY REQUIREMENTS 

As mentioned earlier, the Euler-Backward scheme for time marching the 
prognostic equations for the 3-D structure of wind, pressure, temperature and moisture 
requires the storage of only one time level of information. The * 'ed time level is an 
intermediate time level and only needs to be as deep (with respect to the vertical) as is 
required to solve the equations at a layer. It should be noted that only the vertical 
advection terms couple the model layers together and that to solve the equations at layer 
k requires the dependent variables at layers k+1, k and k-1. Therefore, the * 'ed time 
level only needs to be 3 deep (it holds the prediction values to be used during the 
correction step) and can be reused for the solution of each layer. 

Given that the 19 model layers contain 128 x 96 grid points each, the basic 
memory required is 

u (128, 96, 19) 
v (128, 96, 19) 

T (128, 96, 19) 
q (128, 96, 19) 
pi (128, 96) 

ustar (128, 96, 3) 
vstar (128, 96, 3) 
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tstar (128, 96, 3) 
qstar (128, 96, 3) 
pistar (128, 96) 

If an additional layer were to be added only the u, v, T and q arrays would be 
increased. The ustar, vstar, tstar and qstar arrays are always dimensioned 3 deep and 
this is a function of the vertical advection terms which require 3 layers of storage to 
solve the equations. 

In contrast, the Leapfrog scheme would require 2 sets of arrays dimensioned 128 x 
96 x 19, therefore, there is a considerable memory savings with the Euler-Backward 
Scheme. A technique developed by Tuccillo (1983) shows some promise in reducing the 
computational work by increasing the premissable timestep. 


7. METHOD OF SOLUTION 

The method of solution is depicted in Fig. 3 and shows the sequence of steps 
required to solve the equations at all layers. Prediction is the step that advances the 
solution from the n to the * time level and correction is the step that advances the 
solution from the * to the n+1 time level. It there are NZ layers then there are 2*NZ 
number of steps required to advance the solution one time step. The number above each 
line represents the order of solution where the first step is to perform prediction for 
layer 1, the second step is prediction at layer 2, the third step is correction at layer 1 
and so on. After correction (the 2*NZ step) at layer NZ is finished the solution has been 
advanced one time step. 
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LAYER 


NZ 


NZ-1 


2 * NZ-2 


2 * NZ-4 




PREDICTION 




2 * NZ 


2* NZ-1 


2 # NZ-3 


CORRECTION 




n+1 


Fig. 3 Sequence of steps to advance the solution 

ONE TIMESTEP 
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The *'ed arrays are reused for each layer and the calculations for each layer are 
fully vectorized where the vector lengths are NX*NY or 12288. For this vector length 
the machine is computing at about 98% of its maximum rate. 


8. BOUNDARY CONDITIONS 

Since MASS is a limited area model, as opposed to a global model, the solution at 
the horizontal boundaries needs to be specified. The technique for specifying the 
boundary conditions consist of blending externally calculated values using a weighted 
average formula which is represented by 


3ip 

n 


xv 2ja 


■jrirerizorz 


t* (i-'v) ^ 

v <3t 


ExlEflXOfl 


where W = 
W = 
W = 

w = 


0 on outer column and row 
0.33 3 on first column and row in 
0.666 on second column and row in 
1.0 on third column and row in 


It should be pointed out that this technique produces an overspecification at the 
boundary and higher horizontal diffusion is required near the boundaries to control noise 
generation. 

This technique is vectorized by holding the externally specified boundary 
tendencies in a vector and using the scatter instruction to expand them into the correct 
positions prior to computing the weighted average. This technique minimizes to amount 
of storage required. 
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9. PROGRAMMING TECHNIQUES 

The code is completely vectorized in the horizontal. The average vector length 
is about 12000 which represents the number of horizontal grid points. There is a loop 
over the vertical layers. 

Some specific techniques used during the coding are 

o 32-bit arithmetic 

Sensitivity tests have indicated that 32-bits provides enough precision. 
Using 32-bits effectively doubles the real memory and halves the execution 
time. 

o Explicitly Vectorized 

The code does not depend on automatic vectorization by the compiler. 
All descriptors are set up with DATA and ASSIGN statements. Special Q8 
calls are used where required. 

o Diadic and Triatic Structure 

All vector statements are written in a diadic structure (triadic when 
linked triads are created) to minimize compiler generated dynamic space 
which may cause paging. 

o Subroutines are kept small enough so that the Register File is not 
overflowed. 

Subroutines which have more local variables then the size of the 
register file (approximately 200) can be inefficient since loads from 
memory must be executed. All subroutines are kept small enough so that 
the swap instruction can load all necessary local variables at entry. 
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o Parameter Statement Used for Vector Dimensions 

Vector dimensions are easily changed by changing parameter values. . 

o Factoring of Equations to yield Linked Triads 

The sequence of instructions have been arranged to yield the maximum 
number of linked triads. 

o Run Only in Real Memory 

No page faults are generated during the interative time marching. 

o Vectors are Grouped on Large Pages 

All large vectors are placed in common and grouped on large pages 
using loader options. 

o Bit Vectors vs. Gather/Scatter 

For those situations where control store or gather/scatter can be 
applied, an analysis using the nominal performance figures for each 
instruction was performed and the most CPU or memory efficient 
techniques was applied. 

10. TECHNIQUES FOR REDUCING CPU TIME 

A 24-hour simulation with the model requires 1312 timesteps. Each timestep 
requires the evaluation of 2*NZ natural logs (for 12288 grid points). This required 
approximately 22 mins of CPU time using the 32-bit FORTRAN VHALOG function. 
Since the range of arguments for the natural log function was known, a more efficient 
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technique was incorporated where the natural log was approximated with a series 
factored using Hornets Rule. The evaluation requires 11 vector instructions, nine of 
which are linked triads, and runs approximately 40 times faster than the FORTRAN 
intrinsic function. This technique reduced the CPU time spent evaluating natural logs to 
30 secs. 

Other techniques for reducing CPU time consist of approximating the ** 
FORTRAN function with series of square roots (square root in a hardware instruction) 
and inverting scalars to generate vector multiplies instead of vector divides. 

The version of MASS implemented on the CYBER 205 at NASA/Goddard requires 
13 large pages of memory and 15 minutes of CPU time (same as wall time) for a 24 hour 
simulation over the area depicted in Fig. 1. 


11. EXAMPLE OF OUTPUT 

MASS at Goddard features a comprehensive postprocessing system to produce 
output from the model for interpretation. The post processing system runs interactively 
and produces hard copies on a GOULD electrostatic plotter. Future versions of 
the postprocessing system will likely feature interactive color graphics which should 
greatly improve the usability of the modeling system as a research tool for studying 
atmospheric processes. Figs. 4-12 are examples of the output from three of the six 
postprocessing programs currently available. 
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MASS FORECASTS 500 MB TEMPERATURES (DEGREES CELCIUS) 




MASS FORECASTED 500 MB HEIGHTS (METERS) AND 
VORTICITY (PER SECOND) 
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MASS FORECASTED 500 MB WIND VECTORS AND ISOTACHS 
(METERS PER second) 
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Fig. 11 Sounding locator map 



2100 GMT 04/02/ B2 



Fig. 12 MASS forecasted sounding 
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Introduction. Significant advances are being made in the theoretical treatment of the conformation 
and dynamics of biological molecules. Several recent convergent developments are responsible for 
opening up new fields of investigation. They include: 

1. The development and application of powerful theoretical techniques taken from statistical physics 
such as Monte Carlo and molecular dynamics simulations to biological systems. 

2. The development of powerful computational hardware such as the Cyber 205. 

3. The development of interactive graphics systems. 

4. The increasing availability of experimental structural and dynamic data such as the ever-growing 
data base of protein crystal structures, small peptide crystal structures and the structural and 
dynamic properties of these same molecules in solution. 

These developments enabled us to undertake the project of studying ligand binding to dihydro- 
folate reductase (DHFR). This is an extremely important enzyme, as it is the target of several drugs 
(inhibitors) which are used clinically as antibacterials, antiprotozoals and in cancer chemotherapy. 1 - 2 
DHFR catalyzes the NADPH (reduced nicotinamide adenine dinucleotide phosphate) dependent reduc- 
tion of dihydrofolate to tetrahydrofolate, which is used in several pathways of purine and pyrimidine 
biosynthesis, including that of thymidylate. 3 Since DNA synthesis is dependent on a continuing supply 
of thymidylate. a blockade of DHFR resulting in a depletion of thymidylate can lead to the cessation of 
growth of a rapidly proliferating cell line. 

DHFR exhibits a significant species to species variability in its sensitivity to various inhibitors. 
For example, trimethoprim, an inhibitor of DHFR. binds to bacterial DHFR’s 5 orders of magnitude 
greater than to vertebrate DHFR’s. 4 - 5 We were interested in studying the structural mechanics, dynam- 
ics and energetics of a family of dihydrofolate reductases to rationalise the basis for the inhibition of 
these enzymes and to understand the molecular basis of the difference in the binding constants between 
the species. This involves investigating the conformational changes induced in the protein on binding 
the ligand, the internal strain imposed by the enzyme on the ligand, the restriction of fluctuations in 
atom positions due to binding and the consequent change in entropy. X-ray crystallographic structures 
of DHFR from a few species, in complex with various ligands, are known, 6 ' 8 as well as partial data 
about the structures in solution. 9 ' 11 The availability of the structure, in the form of atomic coordinates 
for the enzyme system, is a prerequisite for performing any kind of energy calculations. In addition, 
due to the size of these systems as discussed below, only the availability of supercomputers such as the 
Cyber 205 make this project feasible. 

Computational Techniques. The techniques we use to investigate the DHFR system all require the 
calculation of the potential energy of the molecular system. This potential energy is expressed in terms 
of an analytical representation of all internal degrees of freedom and interatomic distances, as in eqn. 

( 1 ). 

V- l!D b [l-e- a< ^^] 2 - D b ) + 1/2XH#(0-0 o) 2 (1) 

+ 1/2 + s cos n<t>) + 1/2 2Xx 2 

+ IXFb* (b- bo) (b' — bo') 

+ ZZF„ # .(0 - 9 O )(9'-0 O ') + ZZF»(b- bo)(0 - 9 0 ) 

+ Z*W cos <t>(9 - 9q)(9' - 9 0 ') + ZZ*VXX' 

+ Z«t2(r7r) 9 - 3(r*/r) 6 ] + ZdidA 

This type of representation of the potential energy in terms of the internal (valence) degrees of 
freedom is called a Valence Force Field. Such valence force fields have long been used in vibrational 
spectroscopy in order to carry out normal mode analysis. 12 Basically the terms in equation (1) express 


63 



the energies required to deform each internal coordinate from some unperturbed "standard" value 
denoted by the subscript "O'. The first term is a Morse potential which describes the energy required to 
stretch each bond from its relaxed value, b 0 . The second term represents the energy stored in each 
valence angle when it is bent from its "standard" value, <9 0 . The third term represents the intrinsic 
energy required to twist the molecule about a bond by a torsion angle, <t>. The fourth term represents 
the energy required to distort intrinsically planar systems by x from their planar conformation, i.e. the 
out of plane term. The next terms represent various couplings between internal coordinates, which are 
known to be necessary from studies of vibrational spectra. 13 They are the bond-bond, angle-angle, 
bond-angle, angle-angle-torsion and out of plane cross-term respectively. The last 3 terms describe the 
exchange repulsion, dispersion and coulombic interactions that occur between non-bonded atoms. 

The parameters D b , Hg , H* , H* , and Fy are the force constants for the corresponding 
intramolecular deformation, r and e characterize the size of the atoms and the strength of the van der 
Waals interaction between them, while the q are the partial charges carried by each atom. The parame- 
ters for the functions were derived from fitting a wide range of experimental data including crystal 
structure, unit cell vectors and the orientation of the asymmetric unit, sublimation energies, molecular 
dipole moments, molecular structure, vibrational spectra and strain energies of small organic 
compounds. 14 ' 19 Ab-initio molecular orbital calculations have also been used in conjunction with the 
experimental data to give information on charge distributions, energy barriers and coupling terms, both 
to supplement and confirm the results obtained from the experimental data. 20 - 21 

Minimisation. Given the analytical representation of the potential energy in eqn. (1), we can 
minimize this energy with respect to all internal degrees of freedom, i.e. solve the equation 

SE/aXj — 0 i«l,3n (2) 

where the Xj are the cartesian coordinates of the molecule. 

The minimisation results in the "minimum energy structure" of the system. Analysis of the minimum 
energy structure reveals the basic structural features of the system along with the interatomic forces 
underlying this minimum energy conformation. At the minimum, we can take second derivatives of 
the energy and construct the mass weighted second derivative matrix. From the eigenvalues of this 
matrix the vibrational frequencies may be obtained and the normal modes from the eigenvectors. 22 The 
conformational entropy of the system can now be calculated from the vibrational frequencies using the 
Einstein relations. 23 The conformational entropy of a system plays an important role in both conforma- 
tional equilibria and binding. 24 

Molecular dynamics. Molecular dynamics is the numerical integration of Newtons classical equa- 
tions of motion. Having specified the potential, we define the initial conditions of the system, the coor- 
dinates of the protein, inhibitor, solvent and a set of initial velocities. Once the initial conditions are 
given, Newtons equations of motion 

- SVC Tj • • •7 n )/S7 i - F(7j ■ ■ -T n ) - niid^/dt 2 (3) 

are integrated forward in time, in order to compute the atomic trajectories T ( (t)..7 n (t) as functions of 
time. The forces are calculated from the energy expression in eqn. (1) by taking analytical derivatives. 
We then take a small time step, At, of = 10~ 15 sec. and applying the acceleration as calculated from 
Newtons law (eqn. 3), we update the velocity and position of each atom, to a new velocity and position 
using a Gear 25 predictor-corrector algorithm or a Verlet algorithm. 26 The forces and acceleration at the 
new positions are then calculated and we repeat the procedure, thus tracing the trajectories of the 
atoms. 

Calculations on the Cyber. One of the systems we are studying, the E. coli DHFR-Trimethoprim 
complex, is the system we have been using to develop the programs on the Cyber 205. Table I lists the 
no. of atoms, internal coordinates and non-bond interactions for this system, to demonstrate the 
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magnitude of the calculation involved. 


Table I 

E. coli Dihydrofolate Reductase System 


atoms 


E. coli Dihydrofolate Reductase 

2490 

Trimethoprim 

40 

155 Waters 

465 


2995 

Internal Coordinates 


Bonds 

2875 

Valence Angles 

4785 

Torsion Angles 

6784 

Bond-Bond cross-terms 

4785 

Bond-Angle cross-terms 

9570 

Angle-Angle cross-terms 

7584 

Angle-Angle-Torsion cross-terms 

6784 

Non-bond pairs 

= 1,600,000 


Minimisation and molecular dynamics both require computing the energy using eqn. (1), changing 
the coordinates and repeating this process many times. Note that each energy calculation involves 
evaluating the appropriate terms in eqn. (1) for each of the internals listed in table I. Thus the last 
three terms in eqn. (1) need to be evaluated for each of the 1,600,000 non-bonded pairs. As the time 
required to compute the change in the coordinates once the energy has been calculated is small, the 
time required to calculate the energy determines the time to perform the minimisation, or how many 
steps of dynamics can be done. For a minimisation the number of iterations depends on how close to 
zero we require the derivatives, for a conjugate gradient minimiser previous experience indicates that 
about 3 times the number of atoms iterations are required to get derivatives to less than 0.05 
kcal/molA, which is about 10,000 iterations for the protein. In molecular dynamics we would like to 
simulate at least 100 picoseconds, preferably a nanosecond, as this is still a very short time compared to 
molecular events such as binding. This requires 100,000 i erations at a 1 femtosecond timestep. Thus 
the speed with which the energy calculation is carried out is crucial. 

Non-bond interaction calculation. Table II shows the timings of the energy routines used to com- 
pute eqn. (1) on the VAX 11/780 and the Cyber 205 for the Dihydrofolate Reductase system. The 
non-bond part of the calculation takes by far the major portion of the CPU time, 78% of the iteration 
time on the VAX, so this was vectorised first. The routine computes the non-bond energy, see eqn. 
(1), by calculating the interaction between all pairs of atoms, except for bonded atoms and 1-3 interac- 
tions. For a 10A cutoff this is =1.6x 10 6 pairs, which is the reason this is the major time consuming 
portion of the energy calculation. This was implemented on the VAX by a residue neighbour list in 
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Table II 


Comparison of the Timing of Energy Calculation routines for 1 Iteration 


Routine 

VAX 11/780 

CYBER 
Vectorised 
Large Pages 

Bonds 

2.42 

0.055 

Valence Angles 

9.06 

0.13 

Torsion Angles 2 

30.69 

0.55 

Bond- bond 

5.25 

0.14 

Bond-Angle 

11.9 

0.25 

Angle- Angle 

16.55 

0.17 

Out of Plane 

2.35 

0.10 1 

Non-Bond 

448.98 

1.23 


Iteration Timing 2 573.58 2.7 


1. The out of plane routine is not vectorised. 

2. The iteration timing is slightly larger than the sum of ail the individual routine timings as it in- 
cludes the time for the minimisation routine itself. 


which for each residue a list of all the residues it interacts with is stored. This neighbour list is set up 
prior to the non-bond calculation and has to be recalculated every so often if a cutoff is used. In the 
non-bond calculation a loop is performed over all the residues and for each residue the interactions of 
all atoms in it with all atoms of the residues in the neighbour list of this residue are computed. This 
routine was vectorised by calculating the interaction of 1 atom^with all its neighbouring atoms as vector 
operations. This gives vector lengths of up to 1000 for a 10A cutoff. A bit vector with the length of 
the number of atoms in the molecule is set up for each atom which indicates whether an atom interacts 
with this atom or not. This is a large array, N 2 /2, where N is the number of atoms, but because of the 
bit addressing capability of the Cyber 205 this only takes up 70,000 words in memory. The perfor- 
mance improvement of this routine after vectorisation is 365 over the VAX, which includes the intrin- 
sic scalar speed of the Cyber 205, some 14 times faster than the VAX. The vectorisation of the non- 
bond routine took approximately 1 month. 

Valence energy calculation . The valence energy and cross-term routines take =20% of the iteration 
time on the VAX. These routines were vectorised next, starting with the torsion angle routine which is 
the next major time consuming routine, 6% of the iteration time on the VAX. The bond, valence 
angle and torsion angle routines already used a list of the internals in the VAX version. These were all 
vectorised by creating vectors for the bonds, valence angles and torsion angles, which gives vector 
lengths from 3000 to 9000 for the dihydrofolate reductase system, see table I. These vectorisations 
resulted in performance improvements of 37 to 90 over the VAX in these routines. 

To date we have achieved a net gain in speed over the VAX 11/780 of 212 for the enzyme simulation 
study described above. 
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ABSTRACT 


Simulation of circuits having more than 2000 active devices requires the 
largest, fastest computers available. A vector computer, such as the CYBER 205, 
can yield great speed and cost advantages if efforts are made to adapt the simu- 
lation program to the strengths of the computer. 

ASPEC and SPICE (1) are two widely used circuit simulation programs. 
ASPECV and VAMOS (5) are respectively vector adaptations of these two simu- 
lators. They demonstrate the substantial performance enhancements possible for 
this class of algorithm on the CYBER 205. ASPECV is in use at ISD. VAMOS is in 
daily production use at MOSTEK. 


INTRODUCTION 


Over the past decade, the design of integrated circuits has become increas- 
ingly complex. Manufacturers who once had special purpose circuits of only a few 
dozen components now have microprocessors and random access memory chips 
constructed of thousands of devices. While early circuits were readily designed 
and debugged by hand, the more complex circuits have necessitated computer 
assistance. 

During one phase of computer aided design, circuit simulation programs are 
used. These programs are given circuit interconnection information (nodes) and 
device characterizations (models). After establishing initial current and voltage 
conditions at time zero, they simulate circuit operation by evaluating device con- 
ductances and node voltages over small increments of time. Due to the rapid 
response of microcircuitry to voltage changes, circuit simulation must often be 
performed at timesteps of a few hundred picoseconds. This small timestep may 
necessitate thousands of steps to simulate circuit performance for a given set of 
initial inputs. Many such simulations (which may each require hours on an IBM 
3081 or CDC 176) are required to thoroughly explore a circuit's characteristics 
over a wide range of temperatures and input sets. 
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The speed of a supercomputer is valuable to engineers designing such large 
scale integrated (VLSI) circuits. These engineers are, however, unwilling to com- 
promise simulation accuracy for speed. For this reason, various projects have 
investigated vector computers (2) (3) (4) for use in the transient analysis of VLSI 
circuits. 

Two well-known and widely used circuit simulators are ASPEC, copyrighted 
by Mr. Frank Jenkins, and SPICE, copyrighted by the Regents of the University of 
California. ASPECV is the product of a technical team from the San Francisco 
District of Control Data Corporation Professional Services Division. This team 
spent approximately one man-year analyzing ASPEC in detail. Their effort 
included extensive conversations with the program's author and the rewriting of 
select areas of code for enhanced performance. 

The program VAMOS was developed by Steven D. Hamm and Steven R. 
Beckerich of MOSTEK Corporation. VAMOS evolved from a simple installation of 
SPICE2 into a program in which 80 percent of the analysis routine code is 
vectorized. Many sections of code were radically changed due to the application 
of algorithmic, rather than simple syntactic, vectorization. 


ARCHITECTURAL CONSIDERATIONS 


ASPEC AND SPICE were initially developed for a type of computer similar 
to the Control Data Corporation 6400. Originally, the programs were designed to 
handle circuits with fewer than 600 devices. Intentional minimization of memory 
requirements increased central processor time. Many users modified ASPEC and 
SPICE for use with large-scale circuits, extending the programs into areas far 
beyond their design. When any design is so overextended, there are often 
undesireable consequences. One obvious consequence was long running time on 
circuits with more than 2,000 devices. 

Optimum performance for both ASPEC and SPICE required retailoring pro- 
gram design to fit the architecture of the CYBER 205. The Cyber 205 used has 
two vector pipes, a 16 megabyte memory, and is capable of 200 million floating 
point operations per second (Megaflops) on 64 bit operands. To maximize perfor- 
mance, the characteristics of this hardware must be considered. Some major con- 
siderations are: 

1. The CYBER 205 defines a vector as contiguous memory locations. While 
ASPEC has a compatible memory organization, SPICE2 linked list storage needs 
re-organization. 

2. The scalar functional units on the CYBER 205 are pipelined. Code that cannot 
be vectorized can be optimized by taking advantage of inherent parallelism. Even 
so, the performance of scalar code will probably be substantially less than the 
theoretical maximum of 50 Magaflops. 
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3. The hardware can generate and use bit vectors, which are useful in vectorizing 
loops containing conditional statements. These bit vectors aid in producing rou- 
tines that have no scalar code and run at full vector speed. 

4. The virtual memory of the CYBER 205 provides over 2 trillion words of user 
memory space. Any program that repetitively uses more than the entire physical 
memory may, however, generate a great amount of paging delay. This fact con- 
strains the choice of algorithms, as a fast algorithm may require additional 
memory. 


PROGRAM DESIGN 


Both ASPEC and SPICE perform their simulations by alternating modeling 
routines with a current matrix solution routine. The modeling routines calculate 
the new device conductances based on device operating points. There is one 
model for each type of device, such as diodes, jfets, mosfets, and bi-polar tran- 
sistors. One model must simulate many different operating modes and 

consequently has many branches and special cases. 

The matrix solution routine calculates branch currents based on the con- 
ductances calculated by the modeling routines. From these currents new node 
voltages are obtained. This routine uses sparse Gaussian Elimination techniques. 
The time required by this routine grows very rapidly and non-linearly with circuit 
complexity. 

In SPICE, to best utilize the long vector capabilities of the CYBER 205, 
an interface routine was written between the vectorized analysis routines and the 
rest of SPICE2. This routine reorganized memory into contiguous vectors and 
established new element pointers. ASPEC was similarly treated. The task was 
less formidable as data was already in homogeneous arrays. 

In both VAMOS and ASPECV, vectorization of device equations is done by 
long vector operations with conditional stores for the results. All devices are 
evaluated in all regions of operation and the results are masked together to form 
composite result vectors. This technique avoids the data motion overhead charac- 
teristic of other methods at a cost of extra operations in each region. For 
VAMOS, the data given in Table 1 shows the tremendous advantage vectorization 
provides. The small amount of scalar store code remaining in MOSFET 
contributes 19.4 of the total 25.5 seconds. 


ROUTINE 

SCALAR 

VAMOS 

RATIO 

LOAD 

19.9 

1.8 

11.1 

DIODE 

79.4 

3.6 

22.1 

MOSFET 

325.4 

25.5 

12.8 


Table 1. VAMOS Routine Comparisons 
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In VAMOS, the vector startup time required by the CYBER 205 caused the 
rejection of a vectorized matrix solution method for subcircuits as used in the 
program CLASSIE (2). Instead, effort was expended in scalar code optimization to 
achieve maximum instruction overlap. As part of the preprocessing phase of the 
program, the row-column lookup is performed once and the indices are stored in 
an auxiliary array. 

In addition to the VAMOS techniques, ASPECV's routine EQNSOL detects 
perfect alignment between rows in the matrix. As circuit size increases, the 
number of such rows increases dramatically. Full row-length linked triads are 
executed in this case. 


PROGRAM PERFORMANCE 


Table 2 illustrates a comparison between a scalar version and VAMOS. The 
scalar version was already heavily optimized. The circuit tested contained 2256 
mosfets, 1312 diodes, 1774 resistors and capacitors, and had 1429 equations with 
98.9 percent matrix sparcity. Overall VAMOS performance was 3 times scalar, 
with 4 times in transient analysis. VAMOS performed the analysis over 100 times 
faster than a VAX-11/780. 


ROUTINES 

SCALAR 

VAMOS 

READIN 

68.4 

51.9 

SETUP 

34.7 

22.7 

DC SOLUTION 

47.8 

19.0 

TRANSIENT 

503.8 

126.4 

OUTPUT 

5.6 

5.6 

TOTAL 

660.3 

225.9 


Table 2. VAMOS Program Performance Comparison 

Table 3 shows the characteristics of a series of flexible circuits which can 
be made any size by repeating a basic circuit block. Resistors and capacitors are 
also present but are irrelevant to modeling time. Table 4 gives execution time for 
two processors running ASPEC, and the current version of ASPECV on the CYBER 
205. It is projected that, with continued effort, for large circuits the CYBER 205 
mosfet run times could be reduced by another factor of 2 to 3. Table 5 shows that 
the time to model a given device decreases with increasing circuit size, a very 
desireable characteristic for VLSI circuitry. 
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CIRCUIT 

DIODES 

MOSFETS 

NODES 

MATRIX 

1 

50 

50 

30 

119 

2 

100 

100 

54 

220 

4 

200 

200 

102 

470 

8 

400 

400 

182 

860 

16 

800 

800 

358 

1718 

32 

1600 

1600 

718 

3473 

Table 3. 

Circuit Characteristics 




CIRCUIT 

TIME 

UNIVAC 

CDC 

CDC 


STEPS 

1182 

176 

205 

1 

420 

30 

6 

3 

2 

622 

82 

16 

6 

4 

869 

208 

42 

15 

8 

1658 

697 

141 

40 

16 

1658 

1421 

301 

76 

32 

1658 

TOO BIG 

TOO BIG 

158 

Table 4. 

ASPEC/ ASPECV Comparison 



CIRCUIT 

AVERAGE TIME (micro-secs) 

VECTOR 


diode 

mosfet 

EFFECIENCY 


1 

9.7 

39 


50 

2 

7.1 

32 


66 

4 

5.8 

28 


80 

8 

5.2 

26 


89 

16 

4.7 

25 


94 

32 

4.5 

24 
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Table 5. ASPECV Size/Efficiency 


Since most circuit simulation runs produce a great deal of printed output, 
current simulations using ASPECV spend the majority of their time in Fortran 
I/O. As an example, one ASPECV circuit containing 1000 devices and 950 nodes 
initially ran in 980 seconds on a UNIVAC 1182 and in 141 seconds on the CYBER 
205. After optimizing everything but the diode and mosfet models, the same 
circuit required 72 seconds on the 205. Of the 72 seconds, 39 were spent in the 
models. ASPECV requires only 44 seconds to simulate the same circuit. Only 6.3 
seconds are required in the models: 1.3 in diodes, 5.0 in mosfets. Although the 
mosfet model is still several times slower than theoretically possible, further 
effort would yield small returns indeed. The simulation mentioned spends over 66 
percent of its time in Fortran I/O routines. 
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CONCLUSION 


Program speedups of 3 to 4 were accomplished through vectorization. 
Future work directed at vectorization of the remaining scalar code may result in a 
similar speed increase. Fortran I/O provides an effective limit to maximum 
attainable speed. 
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This report details some of the new computational methods and equivalent math* 
ematical representations of physics models used in the MCV code, a vectorized 
cont i nuous-energy Monte Carlo code .for use on the CYBER-205 computer. While 
the principal application of MCV is the neutronics analysis of repeating reac- 
tor lattices, the new methods used in MCV should be generally useful for vec- 
torizing Monte Carlo for other applications. For background, a brief overview 
of the vector processing features of the CYBER-205 is included, followed by a 
discussion of the fundamentals of Monte Carlo vec tor i zat i on . The physics mod- 
els used in the MCV vectorized Monte Carlo code are then summarized. The new 
methods used in scattering analysis are presented along with details of 
several key, highly specialized computational routines. Finally, speedups 
relative to CDC-7&00 scalar Monte Carlo are discussed. 

1 n troduct i on 


Monte Carlo calculations fill a special and important need in reactor physics 
analysis -- they represent "truth" against which approximate ca 1 cu 1 at i ona 1 
methods may be calibrated. The Monte Carlo method permits the exact modeling 
of problem geometry, a highly accurate mathematical model for neutron inter- 
actions with matter, and a cross section representation that is as accurate as 
theory and measurement permit. The precision of Monte Carlo results is prima- 
rily limited by the computing time required to reduce statistical 
uncer ta inties. 

Conventional (scalar) Monte Carlo codes simulate the complete history of a 
single neutron by repeated tracking through problem geometry and by random 
sampling from probability distributions that represent the collision physics. 
The accumulation of data for 1,000,000 neutron histories will typically 
require three to seven hours of CDC-7&00 CPU time. On newer computers such as 
the CYBER-205» scalar Monte Carlo codes may run one and one-half to two times 
faster (with some tailoring of the coding) because of the reduced cycle time 
and improved architecture of the scalar processors. Much larger gains are 
possible when the vector processing hardware of the CYBER-205 is utilized. 

The random nature of the Monte Carlo method seems to be at odds with the 
demands of vector processing, where identical operations must be performed on 
streams of contiguous data (vectors) . Early known efforts to vectorize Monte 
Carlo calculations for other vector computers were either unsuccessful or, at 
best, achieved speedups on the order of seven to ten times for highly simpli- 
fied problems. Recent results for Monte-Carlo in multigroup shielding 
applications and in continuous-energy reactor lattice analysis have demon- 
strated that Monte Carlo can be successfully vectorized for the CYBER-205 
computer. Speedups of twenty to fifty times faster than CDC- 76 OO scalar cal- 
culations have been achieved without sacrificing the accuracy of standard 
Monte Carlo methods. Speedups of this magnitude permit the analysis of 
1,000,000 neutron histories in only five to ten minutes of CPU time and thus 
make the Monte Carlo method more accessible to reactor analysts. 
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General Considerations for Vectorized Monte Car 1 o 


Conventional scalar Monte Carlo codes may be characterized as a collec 
tion of random decision points separated by short and simple arithmetic. 
Individual neutron histories are simulated, one at a time. The basic idea of 
vectorized Monte Carlo is to follow many neutrons simultaneously through their 
random walks, using vector instructions to speed up the computation rates. 

The many conditional branches (IF. ..GOTO), few 00-loops, and largely random 
data retrieval embodied in conventional Monte Carlo codes preclude vectoriza- 
tion through the use of automatic vectorizing software or by a syntactic 
vec tor i za t i on of coding. Instead, experience has shown that a comprehensive, 
highly integrated approach is required. The major elements of such an 
approach are as follows: 

1. The entire cross section and geometry database must be restructured to 
provide a unified data layout. 

2. The entire Monte Carlo code must be restructured (rewritten). 

3. Deliberate and careful code development is essential. 

Clever programming and machine “tricks" alone will not ensure successful vec- 
tor ization of a Monte Carlo code. The key to successful vec tor i zat i on of 
Monte Carlo is that a well-defined structure must be imposed on both the data- 
base and Monte Carlo algorithm before coding is attempted. This structure may 
arise simply from the reorganization of existing data/algorithms or may entail 
the development of special mathematics or physics. Careful and systematic 
development helps to preserve the structure as the vectorized code becomes 
more complex. 

Vector i-zat i on Techniques 

The principal obstacle to vectorizing a conventional scalar Monte Carlo code 
is the large number of I F -s ta tements contained in the coding. Examination of 
sections of coding shows that, typically, one-third of all essential FORTRAN 
statements may be IF-tests. Careful consideration of the Monte Carlo program 
logic and underlying physics permits categorizing these I F -sta tements and 
associating them with three general algorithmic features of Monte Carlo codes 
-- implicit loops, conditional coding, and optional coding. Implicit loops 
are vectorized using shuffling, and conditional coding is vectorized using 
selective operations. This approach to vectorizing Monte Carlo is effective 
on the CYBER-205 and other vector computers having hardware capabilities for 
vectorized data handling. In successful attempts to vectorize Monte Carlo 
methods, 40 to 60% of all vector instructions used in actual coding were vec- 
tor data handling instructions (gather, compress, bit-controlled operations, 
etc . ) . 

The da ta-hand 1 i ng operations associated with shuffling and selective oper- 
ations in the vectorized code constitute extra work that is not necessary in a 
scalar code. This extra work offsets some of the gain in speed achieved from 
vec tor i zat i on . For vec tor i zat i on to be successful, overhead from shuffling 
and selective operations should comprise only a small fraction of total com- 
puting time. It is thus essential that all data handling operations be 
performed with vector instructions. Vector computers that must rely on scalar 
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data handling operations are severely limited in vectorized Monte Carlo per 
f ormance . 

Cone 1 us i ons 


Continuous-energy Monte Carlo methods have been vectorized for the CYBER-205 
and the speedups are large. Due to the drastic restructuring of the Monte 
Carlo coding and data base, the MCV code has been limited to the treatment of 
repeating reactor lattice geometry. This restriction has been deliberate, 
however, to permit an orderly and careful program of development. There are 
no a priori limitations on the methods used in vec tor i zat i on that would pre- 
clude extension to more general applications. Profound changes in the methods 
used for reactor physics analysis are anticipated now that 1,000,000 neutron 
histories may be run in only five to ten minutes with the CYBER-205 vectorized 
Monte Carlo vs. the three to seven hours that are typical for CDC-7&00 scalar 
Monte Carlo. 
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Abstract 

A miscroscopic dynamical treatment of chemical systems comprising both 
light particles that require a quantal description and heavy ones that may be 
described adequately by classical mechanics has recently been presented 
[J. Chem. Phys. 78, 2240 (1983)]. The application of this ' ' hemiquanta 1 ' ' 

method to the specific problem of the vibrational relaxation of a diatomic 
molecule embedded in a one-dimensional lattice is presented. The vectorization 
of a CYBER 205 algorithm which integrates the 10 3 -10 4 simultaneous 
9 9 hemiquantal ' ' differential equations is examined with comments on opti- 
mization. Results of the simulations are briefly discussed. 


* 

David Ross Fellow 
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I. Introduction 


A microscopic dynamical description of a chemical system composed of both 
light particles that require a quantal description and heavy ones that may be 
described adequately by classical mechanics has been proposed recently [J. 
Chem. Phys. 7JJ, 2240 (1983)]. The description consists of a self-consistent 
set of 9 'hemiquantal 9 9 equations (HQE) arrived at by taking a partial classical 
limit of Heisenberg's equations of motion for the system. In form, the HQE 
appear to consist of Heisenberg's equations for the light particles coupled to 
Hamilton's equations for the heavy particles. The coupling is self-consistent 
in that there is an instantaneous feedback between the light and heavy 
subsystems, with total energy and probability of presence of the quantal 
subsystem being conserved. 

This paper will focus on the numerical solution of the HQE on the CYBER 205 
for the special case of a diatomic molecule embedded in a cold, one-dimensional 
lattice. In Section II, we detail the model and specific form of the HQE, 
while the CYBER 205 algorithm and steps taken to optimize performance are 
included in Section III. Results of the simulations and some discussion of 
their physical significance are presented in Section IV. 

II. Model and Equations of Motion 

Figure 1 depicts the physical situation, i.e. a single diatomic molecule BC 
occupying a substitutional site in an otherwise pure one-dimensional lattice of 
atoms A) the end atoms of the lattice are assumed free. So that the normal 
modes of the lattice are known analytically, the mass of BC is taken to be 
equal to that of A. The heavy, classically behaving degrees of freedom are 
considered to be the displacements (u.,) of the lattice atoms, including the 
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center of mass of BC, from their equilibrium positions. The internal vibration 
(q) of BC is treated quantally and, for simplicity, as a harmonic, two-state 
system. Ve assume that only nearest-neighbor atoms interact with one another: 
A - A interactions are harmonic ; A-B and A-’-C interactions are approximated by 
Morse potentials. 

Under these conditions, the HQE take the form 

c.(t) - -ilf^e.c.U) + ]> V. j ({u k (t)})c j (t)] 

j 

u (t) = p (t)/m (1) 

l l A 

P i (t) = " lu »< ( t) } ) + 5 c j *(t)c k (t)F. jk ({u a (t)}) . 

jk 


Here c^ is the occupation probability amplitude for quantal state i) p . is the 
momentum conjugate to u.^ U is the harmonic part of the potential, i.e. 

n-2 N-l 

0 ' \ u i + i - “i> 2 * 1 < 2 > 

i“l i*n+l 

where N is the number of lattice atoms. F is the quantal force 
defined by 


F 


ijk 


dV. . / 3u 
1J 


k 


where 


( 3 ) 
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(4) 


V'V’ - <ilV *B ♦ • 


and tlie Morse potential V An is explicitly 

Ad 


V AB “ D AB { ' II,[ ‘ a 4B ( %' U n-l + L ' V 1,) " l) ' 


(5) 


with a similar expression for V ir . 

Since the c^ are complex, the HQE consist of 2N+4 coupled first-order 
ordinary differential equations. Given initial conditions appropriate to the 
physical situation, we can integrate these numerically by standard techniques. 
Our principal problem now is to develop and optimize an algorithm appropriate 
to the CYBER 205. 


III. CYBER 205 Algorithm 


The HQE [Eqs. (1)] can be cast in terms of the vector differential equation 
X = f(X(t)), defined by 

x (t) = f ! (i^ , x ), x (0) = x° , 

1 11 n 1 1 

: : ( 6 ) 


X (t) = £ (x , . 
n n l 


. . , x ) , x (0) = x 
n n n 


The vector X can be written as 


X = [C,D,P] where, for example, 

C = [Cl, C2 , C3 , C4J . (7) 
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From experience, we have found the HQE extremely well-behaved. Therefore, they 
can be handled with a relatively simple differential equation solver. We 
employ the familiar fourth-order Runge-Kutta algorithm (RK4) which, for our 
case, is summarized by the following equations: 

= T f (X) 

* 2 - T f(X + K^/2) 

* 3 = T f(X + K 2 /2) (8) 

K 4 = T f(X + X 3 ) 

X[(n+1)T] = X(nT) + (X^K^/6 + (X 2 +X 3 >/3 

where T is an appropriately chosen time step. Our choice of RK4 is guided by 
several considerations* it is quite stable, self-starting and easily coded for 
the CYBER 205. In addition, we need no direct method of estimating truncation 
error since we can calculate total energy and probability of the system as a 
check. Eventually, the RK4 algorithm will be used to calculate input values 
for a more sophisticated predictor-corrector routine. 

Since our simulations require widely varying amounts of memory, we would 
like to assign storage at execution time. Clearly, the vector pipelines are 
used more efficiently if the entire derivative vector is manipulated at once. 
If we are to deal almost exclusively on the dynamic stack, we need a method of 
parsing the vector X into subvectors C,U,P which can then be handled 
independently. This ''breaking up'' is accomplished by building descriptors 
using SHIFT and OR operations on an integer equivalenced to a descriptor which 
points to an area in dynamic space. The subroutine BREAKUP is presented in the 
Appendix. This routine allows the RK4 mainline to allocate storage dynamically 
while permitting the derivative routine to access each subvector individually. 
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We now concentrate on the vector function subprogram that calculates the 
derivative f(X). In our model, the four probability amplitudes must be 
accessed individually each time the function is called. Rather than waste a 
vector instruction to store the subvector C in a temporary array, it is faster 
and more convenient to use the following sequence of hardware calls to load 
them directly into registers: 

ASSIGN TEMP, C 

CALL Q8L0D (TEMP, , Cl) 

CALL Q8IX(TEMP, 64) 

CALL Q8L0D( TEMP,, C2), etc. 

The constants needed to calculate the potential and force functions are 
computed in advance and passed via labeled common. By reviewing an assembly 
listing of the program, one can minimize the number of loads necessary to 
access these constants. The evaluation of U is easily done by a vector 
multiplication with a stored reciprocal mass. 

P can be conveniently calculated by evaluating the derivative of a fully 
harmonic potential U # . Thus we have 

“ f— U'({u.}) = k(-2u.+u. +u., .) where 

ou. j 1 i-I 1+1 

i J 

% = U 1 ' Vi = v (9) 

which can be effected by two vector additions and two vector multiplications as 
f ol lows : 
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--fn u ' (lu} ) = otemp<i;n> - k«(-2.*utemp(i;n) + otemp(o;n) + utemp(2;n)) 

J 

where UTEHP is a temporary array set to the current values of U. Finally, P i$ 
obtained by replacing the n-1, n, and n+1 elements of UTEMP by the proper 
values reflecting the Morse potentials at the diatomic. To accomplish this, it 
is necessary to access the five displacements {u^ i = n-2, n+2}. Alternative- 
ly, descriptors could be built to define the necessary vectors on U and the 
values stored in UTEMP. In this case, hardware calls would be required to set 
the first and last elements of UTEMP, to access the five elements of U around 
u q , and to store values in the three middle positions. 

The conservation of total energy and probability gives us two necessary 
criteria to check the accuracy of the numerical solution. The total energy is 
given by 


E - U( {u. } ) + P'P/(2m A ) 
l A 


+ Ic l\ + IcjV 

oo 11 


( 10 ) 


+ |c J 2v ™ + 2Re{c •c 1 )V 1ft + KlV 
o oo o 1 10 1 11 


while total probability is simply 

P - Ic I 2 + Ic, | 2 , (11) 

o 1 

which must remain unity. These checks were made every 1000 iterations using 
values calculated in the first pass through the derivative routine. To 
calculate U' [Eq. (9)], the following code is used: 


91 


ASSIGN TEMP, •DYN. N-l 


TEMP= Q8VDELT(U;iEMP) 

EU= (K/2) * Q8SD0T( TEMP, TEMP) . 

In Table I # sample iteration times and estimates of floating point 
operations per second are given. The timings are for loops without I/O or 
accuracy checks. The results of several simulations are presented in the next 
Section. 

IV. Results of Simulations 

Our simulations all take the diatomic to be in its excited state and the 
lattice to be at OK initially. This means that all elements of X(0) are zero, 
except the real component of 0^(0), which is unity. The time step size is .01 
<d \ where o> is the transition frequency of the diatomic. The quantity of 

principal interest here is |c^| , the probability of the diatomic being 
excited. The physical constants for the system, which are chosen roughly to 
mimic HC1 in Ar, are listed in Table 2. The only variable quantifies are a) and 
N. The transition frequency is chosen low in order to observe relaxation on 
the time-scale of the simulation. 

2 

Figure 1 displays plots of l c ^l versus time for a sampling of simulations. 
Frames (a)-(c) demonstrate the effect of increasing the diatomic's transition 
frequency w, (cm *) holding the number of lattice atoms fixed. It appears that 
the rate of loss of energy from the diatomic increases with increasing 
frequency up to a point. In fact, frame (c) suggests that the diatomic evolves 
to a metastable state in which it loses no further energy. To test this 
hypothesis, we increased the number of lattice atoms to N = 2000. The result, 
shown in frame (f), bears this notion out. For purposes of comparison, we 
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include a simulation for a smaller lattice (N = 200) . Here we see the effect 
of a pulse, which bounces back and forth, interfering with the monotonic 
relaxation of the diatomic. 

V. Conclusion 

These simulations represent the first application of a new description of 

the dynamics of chemical processes. Most previous approaches employ long-time 

asymptotic approximations, in which the coupling between the subsystems is weak 

and the decay is therefore very slow on the time scale of molecular motions 

(10 14 s). The advancement of ultrafast laser spectroscopy now allows chemists 

-12 

to monitor directly fast relaxation processes (10 s) . In this regime, the 
coupling is more significant, and accurately solving the equations of motion 
becomes crucial. The HQE can be used for this purpose. However, any practical 
implementation will require a vector processor, such as the CYBER 205. Our 
calculations would be essentially impossible on Purdue University's 

6500/6500/6600 system, for example. The calculations would take 50-100 times 
longer, even if the storage for the vectors were available. 

The main feature of our CYBER 205 algorithm is a mainline that assigns 
storage at execution time. The vector function subprogram that evaluates the 
derivative can access the subvectors individually while the mainline processes 
the entire vector. This is accomplished by building the appropriate 
descriptors using the BREAKUP subroutine (see Appendix) . 

Some preliminary results were presented in Section IV. Future research 
will deal with the actual mechanism of energy exchange between the two 
subsystems. Also planned are some N-state models with applications in surface 
chemistry , 
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Table I. Increase of calculation speed with increase 
of number of equations 



Equations 

Iteration Time 

Mega FLOPS 


24 

.157 

ms 

6. 

.1 

204 

.204 

ms 

22, 

.8 

804 

.256 

ms 

37, 

.9 

2004 

.671 

ms 

69 

.3 

4004 

1.19 

ms 

77 

.7 

10004 

2.75 

ms 

83 

.8 

20003 

5.37 

ms 

85 

.6 
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Table II. Parameters of model system 


= 9.25 x 10 15 ergs = 

a AB = 1 * 83 1 1()8 Cm_1 a AC = 

2 

k = 814 ergs /cm m^ = 

m fi - 1.67 x 10 -24 g “c 
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Appendix 

SUBROUTINE BREAKUP (X, NSUB , LENSUB , DESSUB , NDIM) 

IMPLICIT INTEGER (A-Z) 

C 

C BREAKUP- TAKES A DESCRIPTOR (X) AND MANUFACTURES OTHER 
C DESCRIPTORS [DESSUB (N)] THAT POINT TO SUBVECTORS OF 

C LENGTHS LENSUB (N) WHICH COMPRISE THE VECTOR POINTED 

C TO BY X 

C 

C ARGUMENTS: 

C X- DESCRIPTOR TO BE 'BROKEN UP' 

C NSUB- NUMBER OF SUBVECTORS 

C LENSUB- ARRAY CONTAINING THE SUBVECTOR LENGTHS 

C DESSUB- ARRAY CONTAINING THE RESULTING DESCRIPTORS 

C NDIM- DIMENSION OF LENSUB AND DESSUB 

C 

DESCRIPTOR D, X, DESSUB( NDIM) 

DIMENSION LENSUB (NDIM) 

EQUIVALENCE (D, DTEMP) 

ASSIGN D, X 

ADD= SHIFT ( SHIFT ( DTEMP, 16 ), -16) 

DO 100 N=1 ,NSUB 

LENGTH= SHIFT ( LENSUB (N) ,48 ) 

DTEMP= OR( ADD, LENGTH ) 

ASSIGN DESSUB (N) ,D 
ADD= ADD + 64*LENSUB(N) 

100 CONTINUE 

RETURN 

END 
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CHEMICAL APPLICATION OF DIFFUSION QUANTUM MONTE CARLO* 

Peter J. Reynolds and William A. Lester, Jr. + 
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University of California 
Berkeley, California 94720 

The diffusion quantum Monte Carlo (QMC) method gives a stochastic 
solution to the Schrodinger equation. This approach has recently been 
receiving increasing attention in chemical applications as a result of 
its high accuracy. However, reducing statistical uncertainty remains a 
priority because chemical effects are often obtained as small differences 
of large numbers. We give as an example the singlet-triplet splitting of 
the energy of the methylene molecule CH,,. 

We have implemented the QMC algorithm on the Cyber 205, first as a 
direct transcription of the algorithm running on our VAX 11/780, and 
second by explicitly writing vector code for all loops longer than a 
crossover length C*. We discuss the speed of the codes relative to one 
another as a function of C*, and relative to the VAX. Since CH 2 has 
only eight electrons, most of the loops in this application are fairly 
short. The longest inner loops run over the set of atomic basis 
functions. We discuss the CPU time dependence obtained versus the number 
of basis functions, and compare this with that obtained from traditional 
quantum chemistry codes and that obtained from traditional computer 
archi tectures. Finally, we discuss some preliminary work on restruc- 
turing the algorithm to compute the separate Monte Carlo realizations in 
parallel — potentially allowing vectors of unlimited length. 
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1. BACKGROUND 


In recent years Monte Carlo methods have been increasingly 
applied to quantum-mechanical problems. Quantum Monte Carlo (QMC) 
methods fall into two major categories. Variational QMC''’ is a 
method of evaluating expectation values of physical quantities with a 
given (generally optimized) trial wave function The procedure 

in effect amounts to evaluating a ratio of two integrals, although 
the actual Monte Carlo procedure is generally more sophisticated. 

The second major category of QMC is the "exact" type. In these 
latter approaches the Schrodinger equation is actually "solved". It 
is not necessary to already have a highly accurate wave function in 
order to compute the expectation values. Properties of interest are 
in effect "measured" as the system evolves under the Schrodinger 
equation. When a stationary state is obtained, averages of the 
measured quantities give the desired expectation values. 

Only recently have chemical calculations by exact QMC methods 
3 4 

been carried out. ’ We will discuss here one such QMC method — 
the fixed-node, diffusion QMC — which we have been using in cal- 
culating molecular energies. In Section 2 we present the basic 
theory. Section 3 describes the algorithm. The implementation of 
this algorithm on the Cyber 205, its optimization, and results, are 
discussed in Section 4. 
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2. BASIC THEORY 

The Schrodinger equation may be rewritten in imaginary time, 
and with a constant shift in the zero of energy in the following form: 

, [DV 2 - V(R) * E t ] f(R,t) . (1) 

_ 2 

Here D = fi /2m g , R is the three-N dimensional coordinate vector 
of the N electrons, and V(R) is the potential energy (the Coulomb 
potential for a molecular system). Equation (1) is simply a 
diffusion equation combined with a first-order rate process, and thus 
may be readily simulated. The function ^(Rjt) plays the role of the 
density of diffusing particles. These particles undergo branching 
(exponential birth or death processes) according to the rate term 
[Ey - V(R)] 'F(R) . Thus, the number of diffusers increases or 
decreases at a given point in proportion to the density of diffusers 
already there. 

The steady-state solution to Eq. (1) is the ground-state 
eigenfunction of the Schrodinger equation. Furthermore, the value of 
Ey at which the population of diffusers is asymptotically constant 
gives the energy eigenvalue Eq. The lowest eigenstate, however, is 
that of a Bose system. In order to treat a Fermi system, such as a 
molecule, we need to impose anti -symmetry on f(R). A method which 
does this, and at the same time allows us to sample more efficiently 
(to reduce our statistical error), is importance sampling with an 
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anti-symmetrized importance function 'Fj. The zeros (nodes) of ^ 
become absorbing boundaries for the diffusion process, maintaining 
the anti-symmetry. A simple form for which gives the necessary 
anti-symmetry is a Slater determinant of molecular orbitals 
multiplied by a symmetric function of the coordinates. 

To implement importance sampling, one simply multiplies Eq. (1) 
by 'Fj and rewrites it in terms of a new probability density f(R,t) 
given by 


f(R,t) = 'Fj (R) >F(R,t). 


( 2 ) 


The resultant equation for f can be written as 

~ = 0V 2 f + [E t - E L (R)]f - DV-[fF q (R)] . (3) 

The local energy E^(R) and the "quantum force" Fg(R) are simple 
functions of ’Fj(R). Eq. (3), like Eq. (1), is a generalized 
diffusion equation, now with the addition of a drift term, due to the 
effect of Fg. It is Eq. (3) that we solve stochastically. Using a 
Green's function approach, our diffusers are made to follow a "random 
walk" (Markov chain) in such a way that their asymptotic distribution 
is given by the steady-state solution, f^R), of Eq. (3). Properties 
of interest (such as the energy) are measured during the "walks", and 
are thus averages over the distribution f^R). 
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3. ALGORITHM 


We present here an outline of the algoritnm for performing 
diffusion QMC. For more detail see Ref. 4. This algorithm is not 
structured specifically for the architecture of the Cyber 205. We 
will return to this point in the next section. 

(0) Initialization . First generate an ensemble of N c 
configurations of the N-electron system. Typically N c « 100-500. 
These coordinates may be chosen randomly, or more efficiently from 

i i2 

the distribution |^ T (R)| . This initial distribution is 
f(R, t=0) ° 

(1) Loop over blocks . In each block: 

(2) Repeatedly loop over the ensemble until the time in each 
configuration has reached the chosen target time. For each 
member of the ensemble compute the inverse of the Slater 
matrix. Then: 

(3) Loop over the electrons . Compute Fg for the current 
electron. Move to 

r = r + DtFq + x (4) 

where t is the discrete time-step size, and x is a 
3-dimensional Gaussian random variable with a mean of zero 
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and a variance of 2 Dt. This corresponds to the diffusive 
motion. If the electron crosses a node, eliminate the 
configuration from the ensemble and continue loop (2) over 
the ensemble. Otherwise update the Slater matrix and its 
inverse, and continue loop (3). 

After all electrons in the current configuration have been 
moved, advance the time associated with this new configuration 

I l 

R by t. Calculate E, (R ). Also calculate the branching 
factor, or multiplicity. 

M = exp (-t{[E l (R) + E l (R')]/ 2 - Ey}). (5) 

Return M copies of this configuration to the ensemble. This 
branching, or birth and death process, corresponds to the rate 
term in Eq. (3). Weight all averages by M. Continue loop (2). 
After all members of the ensemble have reached the target time, the 
current block is finished. Use <E^> to update Ey . Store <E^> 
and the other averages. "Renormalize" the ensemble back to its 
original size N . (This is necessary because the population grows 
or shrinks exponentially. Although we have endeavored to make the 
exponent close to zero [cf Eq. (5)], asymptotically at large time tne 
population will either vanish or overflow the allocated storage.) 
Reset all averages to zero. Continue loop (1) for the desired number 
of blocks. 

(4) Average over blocks . 
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4. CYBER 205 IMPLEMENTATION. 


The problem we chose to study is the singlet-triplet energy 
splitting of the methylene molecule, CH 2 . CH 2 is fairly typical 
of the molecules we have been studying by QMC, in terms of the number 
of electrons and the number of nuclei. As a result, most of the 
inner loops in this application are quite short. The longest inner 
loop runs over the set of atomic basis functions. With this in mind, 
we present our results on the relative performance of the Cyber 205 
and the VAX 11/780. To compare with the CDC 7600, we note that our 
code runs almost exactly ten times faster on the 7600 than on the VAX. 

We have implemented the QMC algorithm on the Cyber 205, initially 
by simply transcribing our working program from the VAX to the Cyber. 
The major impediment at this stage was the lack of unformatted 1/0 on 
the Cyber and, even worse, its inability to handle logical records 
longer than 137 bytes. After rewriting these portions of the code, 
the program finally ran. 

With automatic vectorization Doth on and off, the Cyber ran 
approximately 16 times the speed of the VAX. Apparently, any 
speed-up from vectorization of the longer loops was lost to the 
start-up time for vectorizing the short loops. It seemed clear that 
explicit vectorization was required. Thus, as our next step, all 
long inner loops of constant length were written explicitly in vector 
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syntax, while short constant-length loops were left as 00 loops. 

Most loops in the code, however, are of variable length. These were 
all recoded in the form: 


IF (length .GT. C*) THEN 


[Vector code] 


ELSE 


[Scalar code] 


END IF. 


We present in Figure 1 our performance results as a function of 
the crossover length C*. At values of C* greater than 26 the scalar 
section of code is always being executed, and thus the curve flattens 
out. For C* less tnan approximately 16, it appears that vector 
start-up time hinders performance. The optimum crossover point 
appears to be around 16. The lowest of the three curves corresponds 
to the implementation described above. Subroutine calls are quite 
costly on the Cyber 205. Thus in the middle curve we show the result 
of removing two short subroutines (both written in I F— THEN— EL SE 
form) and substituting vector code directly into the calling 
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programs. The speed-up is fairly dramatic, providing a peak speed of 
close to 20 times the VAX (up from 17). 

Interestingly, although the compliler recognizes that A**2 should 
be replaced by A*A, inside of vector code A**2 calls the float-to-an- 
integer-power routine. Needless to say, this is costly. Essentially, 
changing one line of vector code from A**2 to A*A led to the improve- 
ment shown in the top curve. Clearly the improvement is most 
pronounced for small C*, where this line of code is being executed 
more frequently. 

As mentioned earlier, the longest inner loop is over the number 

of atomic basis set functions, n. Traditional quantum chemistry 
4 5 

codes scale as n or n . Thus increasing the size of the basis 
set can be very costly. In our QMC approach, the algorithmic 
dependence on n is linear. In Fig. 2 we plot the relative run times 
as a function of basis set size on both the VAX (upper curve) and the 
Cyber 205 (lower curve). Both curves are indeed fairly linear in n. 
However, the slope for the Cyber is almost flat. This smaller slope 
is due to an increase in the vector length rather than an increase in 
the number of machine instructions being executed. The result is a 
speed enhancement of 30 over the VAX (up from 20) by n=50. 

Although a factor of 30 over the VAX (or equivalently a factor 
of 3 over the 7600) is certainly good, it is nowhere near our hoped 
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for performance. This can be explained by the fact that even loops 
of length 50 are relatively short on the Cyber 205. Possibly more 
important, however, is that the relatively long inner loops constitute 
only a fraction of the code being executed. Thus, truly high speed 
for this kind of application requires on architectural rewrite of the 
code. 


Looking over the algorithm (cf Sect. 3) it is clear that the 
entire structure is highly parallel. This is a fairly general 
characteri stic of Monte Carlo codes. Thus, on a parallel processor 
the loop (1) over blocks can be eliminated, and each block can be 
computed independently on a separate processor. There is no communi- 
cation required between processors until the very end, when [step' (4)] 
the average over blocks is computed. 

For a truly efficient Cyber 205 algorithm, however, loop (1) is 
too short to vectorize, generally ranging between 10 and 100. Loop 
(2) is much more desirable to vectorize, with N c =» 100-500. To do 
so, this loop must be made innermost in the new algorithm. In other 
words, the entire ensemble must be treated in parallel. Furthermore, 
the vector length is dynamic, since at each time-step the birth and 
death process modifies the ensemble size. We are currently develop- 
ing this fully vector code for future implementation. This code 
appears to have great potential for fully exploiting the vector 
capabilities of the 205. 
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Finally in Table 1, we present our results on the singlet-triplet 
energy splitting of methylene, and compare these results with theory 
and experiment. 
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TABLE 1. 


The ground-state (^Bi) and first-excited state ( ^ Ai ) energies of methylene. 


Method 

3 

Bj-energy (hartrees) 

^-energy (hartrees) 

Hartree-Fock 

-38.9343 

-38.8944 

CI-SD 

-39.1071 

-39.0956 

CI-SDQ (est.) 

-39.122 

-39.105 

QMC 

-39. 128*0.004 

-39.108*0.004 

Experimental 

-39.148 

____ 


l/\ _ 
A 1 

0 

B^ energy (kcal/mole) 

Cl 

9.9-11.3 

Expt 

8.5-19.6 

QMC 

12.3*3.4 
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Figure 1. Relative speeds of the Cyber 205 and the VAX 11/780 for 
quantum Monte Carlo calculations of the ground-state energy of CH^ . 
The crossover point C* is the vector length below which variable- 
length loops are run in scaler mode . Thus , for large C* these loops 
are all run in scaler mode ,whereas for very small C* , vector start- 
up time hinders performance. The three curves correspond to differ- 
ent degrees of hand-optimization of the code. See text for details. 
Note that the curves interpolating the data points are simply poly- 
nomial fits to the data. The actual curve for a particular molecule 
is a set of steps at the values of the various loop lengths that 
occur in the problem. The fits can be considered an "average" 
behavior for this type of calculation. 
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CPU TIME vs VECTOR LENGTH 


TIME (ARBITRARY UNITS) 


CYBER 205 


BASIS SET SIZE 


Figure 2. CPU time versus the number of atomic basis set functions, 
n. Conventional codes scale as n^ with \ « 4-6 while QMC scales 
simply as n. Both the VAX and Cyber show this n dependence clearly 
However, the slope for the Cyber is almost zero. At n=16 the Cyber 
is 20 times the speed of the VAX while at n=50 the Cyber is 30 
times faster. 
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Abstract 

New methods are introduced for improving the performance of the 
vectorized Monte Carlo SU(3) lattice gauge theory algorithm using the 
CDC CYBER 205. Structure, algorithm and programming considerations are discussed. 
The performance achieved for a 16 4 lattice on a 2-pipe system may be phrased 
in terms of the link update time or overall MFLOPS rates. For 32-bit arithmetic 
it is 36.3 usec/link for 8 hits per iteration (40.9 usee for 10 hits) or 
101 .5 MFLOPS. 
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1. Introduction 


Many important results for quantum field theories in general and, in 
particular, for the gauge theory of strong interactions known as Quantum 
Chromodynamics (QCD) have been obtained by formulating the dynamics on a 
space-time lattice. The lattice version of a quantized gauge field theory, 
as proposed by Wilson [1], has the properties of introducing an ultraviolet 
cut-off independently of any perturbative expansion and of preserving 
manifest gauge invariance. It permits a variety of investigations by 
non-perturbati ve techniques, strong-coupling expansions [2] and Monte 
Carlo (MC) simulations [3] being the most notable ones. Monte Carlo 
simulations, indeed, have probably produced the most important results 
for QCD, being able to probe the structure of the theory in the domain 
where the transition between the strong-coupling behavior at large distances 
and the asymptotically-free behavior at small separation takes place. 

Numerical methods must be used to explore the vary crucial domain 
of intermediate couplings, since there are no known analytical techniques 
for solving or even efficiently approximating gauge theories throughout 
that region. On the other hand the fact that quantum fluctuations on 
a finite lattice extending for n sites in four dimensions are given 
by integrals of a dimensionality 4n n^ (n^ is the number of independent 
parameters in group space), which can easily exceed 2,000,000, leaves 
importance sampling, i.e. Monte Carlo simulations, as the only cal cul ational 
possibil ity . 

Monte Carlo calculations are of a numerical nature, and quite 
demanding on computational resources. The simulation of a system with 
SU ( 3 ) gauge group (i.e. the system underlying QCD) on a lattice extending 
for n sites in each of the four space-time dimensions requires storage 
of 4n^ link variables, i.e. 4n^ SU(3) matrices, and the systematic 
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replacement, or "upgrading", of each of these matrices with new, updated 
values, for several hundred or several thousand sweeps of the whole 
lattice. One MC iteration is defined as a sweep of the lattice, i.e., 
one upgrade per link variable. A computation involving M MC iterations 
thus implies 4Mn 4 individual upgrades of $U(3) matrices. The upgrading 
of each SU ( 3 ) matrix requires approximately 4,150 elementary arithmetic 
operations and 180 table look-ups (if 10 attempts at changing the link 
variable are made for each upgrade). For a lattice large enough for 
obtaining physically meaningful results, the amount of computation needed 
for a Monte Carlo simulation of QCD becomes extremely high. 

Because of the aforementioned difficulties, Monte Carlo simulations 

of QCD have been generally limited to lattices of rather small extent, 

a lattice of 8 4 sites already representing a large lattice with respect 

to the scale of most calculations. On the other hand, with the progress 

in the field, it has become apparent that one must definitely analyze 

larger systems to develop confidence in the numerical results. This need 

may be understood on physical grounds. If 2 GeV is considered as a 

universal energy for the effects of asymptotic freedom to begin manifesting 

themselves, one would like the lattice spacing to be smaller than (2GeV)"^ 

(and the correspondi ng ultraviolet cut-off larger than 2GeV) i.e. smaller 

than 0.1 fm. Conversely, if the goal of the computations is to determine 

hadronic structure, the extent of the lattice should be larger than the 

typical size of a hadron. Taking this size to be (minimally) 1 fm, 

it becomes apparent that the parameter n ought to be larger, if possible 

substantially larger, than 10. With, e.g., n = 16 and M = 1000 the 

1 2 

calculation of a MC simulation requires more than 10 operations not a 
small task even for the largest machines currently available. 
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Ill II. Illllllllllllllll■llllll■ 


The number of the data elements involved, and the amount of 
computations needed for manipulating this data, makes it worth while 
to investigate ways for vectorization of the code. 

The purpose of this article is to illustrate the vectorization and 
implementation on the CDC CYBER 205 of a code for Monte Carlo simulations 
of the SU(3) lattice gauge theory. (For previous implementations of 
vectorized code see Ref. 4.) As will be discussed in more detail in 
the final section of this paper, the characteristics and performance are 
such that 1 MC iteration of a 16 2 * 4 lattice can be done in 10.72 seconds 
(corresponding to an upgrade time of 40.9 usee per SU(3) link variable). 
Thus, 16^ and larger lattices can be considered for meaningful 
simulations of QCD. While we describe in this article the program for 
the basic Monte Carlo algorithm, we are currently using it, together 
with other vectorized codes, for a reevaluation on a large lattice, of 
several quantities of theoretical and phenomenological interest in QCD. 
The results of these investigations will be presented separately [5]. 

Here we proceed with a description of the computational algorithm and 
an outline of its vectorization in Sect. 2, with a more detailed 
account of the program in Sect. 3 and a summary of performance data in 
Sect. 4. 

2. The Monte Carlo Algorithm 

We consider a hypercubical lattice of n s sites in each of the 

three spatial directions and n t sites in the temporal one. The 
dynamical variables of the SU ( 3 ) gauge theory are 3x3 unitary- 
unimodular complex matrices, which are associated with the 4n s n t links 
of the lattice. We denote by the matrix associated with the 


122 


oriented link coming out of the lattice site of (integer) coordinates 
x = (xi in the direction y (y»l ,2,3,4). The goal of the 

Monte Carlo algorithm is to produce a stochastic sequence of configurations 
of the system C ^ , (a configuration being defined as the collection 
of all Uj), such that the probability P(C) of encountering any 
configuration C in the sequence approaches, after a reasonable 
equi 1 ibriation time, the distribution 

P(C) a exp{-S(C)} , (2.1) 

where S is the action of the configuration C in that sequence. S 
is given by a sum over plaquette variables p , a plaquette being an 
oriented square of the lattice defined by the origin x and two directions 
y and v : 


S 



6 1(1 - i Re Tr U.) , 

p ^ V 


( 2 . 2 ) 


where 


U 



= u v, u'^u'Vu y 

x x+v x+y x 


(2.3) 


S is the coupling parameter and y , v stand for unit lattice 
vectors in the u and v directions, respectively. When Eqn. 2.1 is satisfied, 
quantum mechanical expectation values of observables &, defined rigorously 
as averages over all possible configurations, namely 
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wi th 


<&> = z ' 1 f n dU^)(?(U)exp[-S(U)] 


(2.4) 


Z = [( n ujJ)exp[-S(U)] , (2.5) 

J x,u 

can be approximated by averages taken over the configurations generated 
by the Monte Carlo algorithm: 

i=N 0 +N 

<&> - A Z © / (C (i) ) . (2.6) 

N i=N 0+ l 

Nq represents the number of initial configurations discarded in order 
to allow for the stochastic sequence to reach equilibrium. 

In our code we implement the MC algorithm following the method of 
Metropolis et al [6]. Each individual dynamical variable is 
replaced by a new one according to the following procedure: 

i) a new candidate matrix is obtained from by group multiplication: 

U u ' = R,U U , 
x k x ’ 

where R k is an SU(3) matrix randomly selected from a prepared set 
{R-| , . . . ,1^} of M matrices, to be discussed later. 
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i i ) ' the change in action, AS induced by the variation U y ■+ U u 

X X 

is calculated: 

AS = S(U^ ' t ...)-S(Ujj,...); (2.7) 

iii) a pseudorandom number r with uniform distribution between 0 
and 1 is generated and 

U x = U x if r < ex P(* AS ) » 

U u = U u otherwise, 
x x 

The steps i) to iii) define what is called a "hit" on one of the link 
variables. These steps are repeated (number of hits) times. This 

completes the upgrading of one (link) variable . One MC iteration 

(or one sweep of the lattice) is executed when all the variables have 
been processed in this manner. 

A crucial consideration for the whole algorithm and also for its 
vectorization is that the calculation of the variation of the action 
AS involves only a few of the dynamical variables apart from 
itself, namely those defined on the remaining links of the six 
plaquettes which share the link between x and x+u . It is 
convenient to be slightly detailed at this point and to introduce some 
terminology. Given the link from x to x+u there are three 
"forward" plaquettes incident on it, namely those with vertices 

x, x+u, x+u+v and x+v , 
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and three "backward" plaquettes, 


liiiiiiiiilliililllllllililll 


(v taking the three values t y) 
namely those with vertices 

x, x+y, x+y-v and x-v , 


(see Fig. 1 ) . 

We shall define as the "force" acting on ll’^ 


F^ v = uV U v 
f,x x+y x+v x 


(corresponding to the forward plaquettes) and 


r’U V _ ,,v n y ,.v+ 

F - u ~ -'Ll , 'U 
b,x x+y-v x-v x-v 


the sum of the expressions 

( 2 . 8 ) 

( 2 . 9 ) 


(corresponding to the backward plaquettes) over the three values of 
v t y 


F u = 


2 (F 
v^y 


uv 

f ,x 


f k V ) 
b,x' 


( 2 . 10 ) 


One can easily convince oneself that of the terms contributing to the 
action in Eqn. 2.2 all those containing can be written in the form 


8[1 - \ ReTr(F^)] 


( 2 . 11 ) 


and therefore 
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( 2 . 12 ) 


AS = - | ReTr[Fj*(uj’- Uj)] . 

Thus, we become aware of two fundamental facts: 

i) once the force is calculated, the subsequent hits on 
the link variable can be done without any further recourse to the 
values of other U variables. 

ii) several upgradings can be done in parallel, provided only that the 
forces F^ required for the computation do not involve any of the u' 3 4 

variables that are simultaneously upgraded. 

While point i) is relevant for any MC simulation, point ii) acquires 
particular importance if one wants to write a vectorized code. Indeed, 
as we shall show, all variables with fixed y can be separated into 

two sets such that the forces for one set only involve elements of the other. 
Then, all the variables belonging to one set can be grouped together 

in an array and upgraded simultaneously. Finally one proceeds to upgrade 
the elements of the other set (the red-black or checkerboard algorithm 
[4]). We will see in the next section that the ability to separate 
the link variables into two independent sets is a key to efficient vector- 
i zation. 

3. The Vectorized Implementation of the Algorithm 

The previous discussion has demonstrated that Monte Carlo lattice 
gauge theories are worthy candidates for vector processing. Until recently, 
however, people were doubtful as to whether the vector capabilities of current 
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supercomputers can be effectively utilized for such applications. The 
main source for this skepticism is the inherent conflict between random 
access to data, an integral part of a Monte Carlo process, and the strict 
order of data elements required for pipelined computations. In other words, 
unless data can be "gathered" at rates comparable to computation rates no 
efficient vectorization can be achieved. 

One of the major strengths of the CDC CYBER 205, and what makes it 
a particularly powerful Monte Carlo machine, is the ability to order a 
random collection of data by means of a vector instruction, namely, the 
"Gather" instruction. This instruction is equivalent to a series of 
random, or, indirect "load" operations on a serial computer. The 
Gather instruction uses a vector of integers as an "index-list" 
pointing to the elements to be fetched. These elements are stored 
in the order they have been encountered into an output vector. The 
result rate for the Gather operation is one element every 1.25 
cycles (a cycle, or clock-period on the CDC CYBER 205 is 20 nonoseconds) . 

For a comparison, note that the floating-point arithmetic rate, excluding 
division, is one element every cycle per pipe for 64-bit operands. The 
CYBER 205 hardware also supports 32-bit operations with twice the 
result rate for vector floating-point operations. For example, on a 
two pipe machine 32-bit arithmetic is performed at a rate of 5 nsec 
per result, or 200 MFL0PS. 

The effective utilization of the computational tools build into 
the vector processor is closely related to the data structure, as are 
most of the important algorithmic decisions. It is, therefore, 
appropriate, at this point, to discuss the memory requi rements . A 
3x3 complex matrix is represented by 18 real numbers. The constraints 
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of being unitary and unimodular reduce the number of independent para- 
meters to 8, but such a minimal representation of the variables 

implies a substantial increase in the computational complexity. To obtain 
optimal performance it is useful to keep all the 18 values representing 
the real and imaginary parts of the elements of . For a lattice 
with n $ = n t = 16 a configuration will be defined by 18 * 4 * 16 4 = 
4.718592 million values, which may be more than can be put in the fast 
memory of many computer systems. Fortunately, the sequential nature of 
the MC algorithm suggests that only a fraction of the variables need 
to be in memory at any one time. The others can be kept on disk. 

The factors which determine an optimal size for the partition between 
variables in memory and on disk are the following: 

i) the partition should not make the code unnecessari ly complicated; 

ii) the I/O operations should not take longer than the actual computations 
insufficiently long vectors should be available. 

On the basis of the above requirements we decided to upgrade one 
space at a time, i.e. to upgrade all the 4n^ variables with fixed 
time coordinate x 4 , and then to proceed to the next x 4 etc. We 
shall refer to this procedure as time-slicing and to the collection of 
variables with fixed time coordinate as one time-slice of the system. 

If the variables with a given x 4 = t are being upgraded, the 
calculation of the force requires knowledge of the with x 4 = t-1 , 
x 4 = t and x 4 = t+1 . Thus 3 time slices need to be in memory 

throughout this stage of the calculation. As a matter of fact, since 
I/O operations can proceed independently from CPU operations, it 
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is possible to achieve concurrency of I/O and CPU operations if 
extra memory buffer space is allocated for holding the x^ = t-2 slice 
(to be written out), and the x^ = t+2 slice (to be read in). The 
conventional way of implementing concurrent I/O is to allocate space 
for two more slices. The resulting five slices in memory act as a 
circular buffer as shown in Fig. 2. However, the virtual memory hardware 
on the CDC CYBER 205, and the supporting software provide the capability 
to swap data between disk and memory. Hence, the memory area of one 
slice only is needed to write out the x^ = t-2 slice, and read in 
the x^ = t+2 slice. Consequently , the total memory requirements for 
the link variables are thus 4 x x 4 * 18 locations. Allowing for some 
additional work-space we find that lattices with n $ = 16 can be 
considered in a machine with 2m words (16m bytes) in full precision 
(64-bit words) and n $ * 20 in half precision (32-bit words). The length 
in time does not constitute a problem any longer and lattices with any n^ 
may be simulated. 

With the slicing mechanism in place we now turn to vectorization 
aspects of the code. In Sec. 2, the Red-Black ordering was introduced. 

The motivation for this choice merits some discussion. The computation 
involves, mainly, matrix multiplications. This operation is easily 
vectorized, but the matrices concerned are 3x3 matrices, and the resulting 
vectors are going to be 3 elements long. For efficiently vectorized code 
one needs to seek longer vectors. This results from the observation 
that the timing formula for a vector instruction may be written as 

(Start-up + a- N) cycles (3.1) 
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where the start-up time is a constant, independent of the vector length. 

It amounts to aligning the input and output streams, filling up the 
pipelines up to the point where the first result is available and storing 
the last result. The start-up time is also independent of the number of 
pipelines and whether 64-bit or 32-bit arithmetic is performed. On the 
CDC CYBER 205 it amounts to about 50 cycles, or 1 usee. The "a-N" term 
is known as the "stream time". N is the number of elements in the vector, 
so that the stream time is proportional to the vector length, a is a 
constant associated with the number of pipelines and the arithmetic mode. 
Table 3.1 contains the a values for some relevant circumstances. 

It is now obvious that high performance is achieved by minimizing the number 
of "start-ups" as a consequence of using longer vectors, or, increasing 
N for each vector operation. 

The SU ( 3 ) matrices are too small as an object for vectorization ; 
however, there are n^ such matrices in every time slice. One cannot 
use all of these link values simultaneously because - 

i) updating each link requires all its immediate neighbors, and 

ii) the correct convergence of the Metropolis process depends upon 
using "new" values as soon as they are available. 

The Red-Black (checker-board) ordering resolves this apparent recursive 
relationship. The separation of the variables into two sets, for 

each value of u and at fixed x^ , is achieved by putting in the two 
sets all the variables belonging to links originating from odd and even 
sites, respectively, i.e. with Xi + X£ + x^ = 1 or 0 (mod 2). This 
assures the independence of the forces from the variables being 
upgraded. On a lattice with n $ = 16 the above separation gives a vector 
length of n^/2 = 2048, sufficiently large to insure almost optimal 
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performance (in fact, 91% and 95% in 32-bit and 64-bit arithmetic, 
respectively). The calculation of the force requires knowledge 
of the variables associated with links neighboring the one under 
consideration. Because of boundary conditions, which we take to be periodic, 
the variables which enter the calculation of will not, in general, 
have a simple location-index relative to in the array of dimension 

n^/2 . This is easily remedied by the introduction of auxiliary 

integer-valued arrays, where the indices of the various neighbors of 
are prestored. The Gather instruction plays a crucial role in the 
way these index arrays are used. When F^ is evaluated, all the 
needed variables are gathered into temporary arrays, so that the indices 
of all elements entering into the computation of F^ are the same, and 
this proceeds in a fully vectorized manner. 

Once the F^'s are determined the algorithm for the upgrading 

of all the 1)^ (in the same set) is straightforward and completely 

vectorizable. The matrices R which are used for finding the new 

candidates , are Gathered according to an array of indices extracted 

at random from a table. The table contains M SU ( 3 ) matrices which have 

a distribution centered around the identity of the group and are obtained 

in the following fashion. For each value of i between 1 and M/2 

(M must be even) an eight component vector with approximately 

2 

gaussian distribution and =1 is pseodoramdomly generated. The 

fourth-order approximation to R- is given by 


R? 1 1 + 1A 




2. exp(iA) , 


(3.2) 
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where 


A = b Z 


Vk 


(3.3) 


A. are Gell-Mann's matrices (i.e., a set of generators of the Lie 
algebra of SU ( 3 ) ) and b is a real parameter specifying the spread 
of the distribution. The final value for is obtained by normalizing 

R? to a uni tary-unimodular matrix. In general, if we denote the 
three columns of an SU(3) matrix by r ]» r 2 anc ^ r 3 t ^ le constraint 
of being unitary and unimodular is expressed by 


2 




(3.4) 


and r^ = (r^xr^)* . 

0 _»n ->o 

Given a matrix R with the first two columns r-| and , with 

HD 0 

r^xr^ f 0 , we shall define as the normalized form of R the matrix 

R with columns 


7 = r 0 / 

n v 


-K3, 

r l ! 


r 2 



r, (r 


* ?° 
1 ’ r 2 



2 


(3.5) 


and 


= ( r i xr 2 ) 
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The reason for the nonmenclature is due to the fact that, if R 1 " 1 differs 
slightly from a unitary-unimodular matrix, e.g. as a consequence of 
roundoff errors, then R is an SI) (3) matrix close in value to R^ 1 
Thus, the approximately unitary-unimodular matrix R? obtained by 
truncated exponentiation in Eqn. 3.2 is converted to a proper SU(3) 
matrix R^ by normalization. The last M/2 matrices are obtained by 

R m = R: (1 < i < 9 ) (3.6) 

2+i 

so as to insure that, together with any given matrix R^ , the inverse 
should also belong to the table. 

The procedure for normalizing the SU(3) matrices of the random 
table, as described above, is also applied, every few iterations, to the 
link matrices. This is done to insure that the group symmetry of the 
matrices is preserved regardless of rounding errors which may be 
introduced by the hardware after many arithmetic operations. This 
renormalization process is particularly important when the computations 
are performed using low precision arithmetic. It gives us confidence, which 
was also tested and verified, in using 32-bit arithmetic for our calcul- 
ations on the CDC CYBER 205. 

Once is determined, using the table of random SU(3) 

matrices, the action difference is obtained by calculating, separately, 

ReTr(F^) and ReTr(F^') 

(notice that ReTr(A'B) is the vector product of the arrays containing 
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the real and imaginary parts of A and B) , forming an array with 
exp(-AS) , comparing with an array of pseudorandom numbers and 
accepting or rejecting the change, via a masking operation, according 
to the outcome of the vectorized comparison between the random numbers 
and the exponentiated action differences. These steps are repeated 
for a prefixed number of hits before commencing the upgrade of the other 
set or the variables corresponding to different directions. 

The conditional acceptance of elements in a vector, or, the masking 
operation referred to above, is handled through the usage of a "bit-vector" 
(the CDC CYBER 205 is bit addressable and the software allows the Fortran 
user to use this feature). It is exploited as a part of the vector 
instruction, and inhibits storing results where zeros are encountered 
in the bit- vector. 

The reader should by now realize that many thousands of random 
numbers are required for each iteration. The conventional congruent 
method for generating random numbers is recursive, and may be described 
by 


y i+1 = (a-y i )mod(b) (3.7) 

where a is the "multiplier" and b is determined so as y.. + i will 
be approximately the lower half of the coefficient of the product a-y^ 
The nature of this calculation suggests that in order to produce N 
random numbers one has to repeat it serially N times. There is, 
however, a way to reproduce the same sequence of N numbers in 
parallel, using vector instructions [7], Define a new multiplier by 
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III III II III II llll 


A = (a N )mod(b) 

= ( . . . (a*a)mod(b)*a)mod(b) . . .*a)mod(b) (3.8) 

and let 

X] - (ypy 2 y N ) (3.9) 

be the vector containing the first N random numbers. 

Then 

Y i+1 = (A*Y.)mod(b) (3.10) 

reproduces the same sequence of random numbers one gets with a 
repeated application of Eqn. 3.7 (the computation of Eqn. 3.10 requires 
only 3 vector operations on the CDC CYBER 205). 

To conclude this section, let us discuss the way matrix multiplication 
is done, being the most time-consuming aspect of the computation . First, 
the reader will remember that we do not vectorize the matrix multiplication 
as such, but, rather, perform the operations on many matrices in parallel, 
where for each matrix the "scalar" sequence of operations is followed. 

When computing the products of two SU ( 3 ) matrices, one need not 
evaluate all the columns of the result, since the third column of the 
product matrix (which is again unitary-unimodular) is related to the 
first two by Eqn. 3.4. In the code we have exploited this fact whenever 
possible. It is particularly advantageous when several SU ( 3 ) matrices 
must be multiplied together, since one may limit the calculations to 
two columns out of three in all intermediate products and simply 
reconstruct the third column of the final result as shown in Eqn. 3.4. 
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Finally, all complex arithmetic has been done in terms of real 
variables, separating real and imaginary parts (which would also 
result in a more efficient code for a scalar machine), and we have 
used the identity 

(A+iB) (C+i D) = (A+B) ( C-D)-BC + AD + i(BC+AD) (3.11) 

to perform the product of two complex matrices in terms of three real 
multiplications and five real matrix additions. Using complex 
arithmetic the product of two matrices would require four real 
multiplications and two additions. Due to the fact that matrix 
multiplication requires 2N operations, where N is the dimension 
of the matrix, and matrix addition requires only N operations, our 
method pays off even for N = 3 . 

A schematic outline of the flow of the calculations is shown 
in Fig. 3. 

4. Performance and Timings 

The figures quoted here are based on runs executed on a two-pipe, 

2m 64-bit words CDC CYBER 205. They apply to a 16 4 lattice (n s = 16, n t = 16), 
SU ( 3 ) gauge theory with 10 hits per link upgrade (unless stated explicitly 
otherwise). We present performance figures for both 64-bit and 32-bit 
arithmetic operations. In both modes the exponentiation and the generation 
of random numbers were carried out using 64-bit arithmetic. It should be 
noted here that due to our slicing mechanism the 32-bit version requires 
real memory of only 852,000 words (64-bit words, or 6.8m bytes), so it 
actually fits comfortably on a 1m words system. With these parameters 
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the code performs at 98 % CPU utilization. The 64-bit version requires, 
of course, twice as much memory. 

In Table 4.1 we give the percentage of the execution time for the 
two arithmetic modes spent in the force (F^) and the Metropolis 
updating calculations. It becomes clear from these figures why it is 
worth while using a single force computation for a number of attempts at 
updating (rather than the one attempt proposed by the original Metropolis 
method) . 

It should be added here the normalization procedure discussed in 
Sec. 3, performed every 5 iterations adds only 0.74% and 0.59% in 
64-bit and 32-bit modes, respectively, to the total execution time. 

Table 4.2 presents a percentage breakdown of the code by operation 
type. The reader will notice that the Gather, random number generation 
and the exponentiation operations are more heavily weighted in the 32-bit 
mode compared with that of the 64-bit mode. These three types of 
operations perform at the same rate in both modes. The last two 
execute in 64-bit mode in both versions of the code. The Gather instruction 
performs at the same rate regardless of whether the operands are 64-bit 
or 32-bit variables. This is because the performance of the Gather 
operation is driven by memory access (and not by computation complexity). 

The matrix multiplication, being made up of floating-point operations 
only, executes at near peak rate of 95 MFLOPS and 182 MFLOPS for the 
64-bit and 32-bit modes .respecti vely. The effect of vectorizing the 
random number generator can be illustrated by noting that this operation 
amounted to 6% (64-bit) and 11% (32-bit) of the total time when it was 
not vectorized. The "action" involves taking the real part of the 
trace of products of SU (3) matrices (purely floating-point operations). 

The "acceptance" is the portion of the code where the conditional acceptance 
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of new ll£ matrices occurs under the control of a bit-vector 
created for that purpose. 

The actual time for one iteration of the 16 4 lattice 
with 10 hits is 16.27 secs. (64-bit) and 10.72 secs. (32-bit). This 
amounts to a substained performance rate of 66.8 MFLOPS (64-bit) and 
101.5 MFLOPS (32-bit). Another way, commonly used by physicists, to 
express the performance of Monte Carlo lattice gauge theories implemented 
on a computer system, is the link update time, i.e., the time needed 
to update one link of the lattice once. This measure is useful for 
comparisons since it is independent of the lattice size. The link 
update times (in ysecs.) for our implementation are given in Table 4.3. 
These figures may be compared to a link update time of about 1,100 
usees on the CDC 7600 computer system with a highly optimized code. 
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Table 3.1. Stream rate proportionality factor (a) . 



Table 4.1. Breakdown by percentage of sections 
of code. 



64- bi t 

32-bit | 

force I 

43.49 

i i 

1 42.46 

uoaate 

P 

55.40 

i _ _ i 

o7.40 


Table 4.2. Breakdown by percentage of the main operation 
types. 


operation type 

64- bit 

32-bit 

matrix multiplication 


58.33 

47.05 

Gather 


20.78 

29.27 

random number generator 

0.95 

1 .83 

exponentiation 


7.43 

11.72 

action 


5.93 

4.70 

acceptance 


3.62 

1 

3.01 

Table 4.3. The upgrades 

times for a link (in usees). 

number of hits 

'64-bit 

32-bit 

10 


62.1 

40.9 

8 


55.1 

36.3 
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Figure 1. "Forward" (upper half) and 
"backward" (lower half) plaquettes in the 
U- v plane, where x = is a 

point in our four-dimensional lattice. 

This is one out of three such planes which 
can be formed in a four-dimensional space. 
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FLOW CHART 



Figure 3 • Schematic description of the computational 

process. 
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* Adapting and designing mathematical software to achieve 
optimum performance on the CYBER 205 will be discussed 


* Comments and observations are made in light of recent work 
done at the Center for Numerical Analysis on 

- modifying the ITPACK software package 

- writing new software for vector supercomputers 
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Research goal - develop very efficient vector algorithms and 

software for solving large sparse linear systems 
using iterative methods 


(older) SCALAR APPROACH - develop algorithms that minimize 
either number of iterations or arithmetic operations 

* Not necessarily the correct approach for vector computers * 


(newer) VECTOR APPROACH - avoid operations such as table 
lookups, indirect addressing, etc. that are inefficient on a 
vector computer, i.e., non-vectorizable 


* Fully vectorizable code may involve more arithmetic operations 
but can be executed at a very high rate of speed * 


* Advances in high performance computers and in computer 
architecture necessitates additional research in mathematical 
software to find suitable algorithms for the supercomputers of 
today and of the future * 
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THE VECTORIZATION OF THE ITPACK SOFTWARE PACKAGE 


Scalar ITPACK: 

package for solving large sparse linear systems 
7 iterative algorithms available 
sparse storage format used 
Kincaid, Respess, Young, & Grimes [1982] 

ITPACK 2C (ALGORITHM 586) in T.O.M.S. 
"Transactions on Mathematical Software" 


VECTORIZATION: 

- First step: look for obvious vectorization changes since this 
was a large package of over 11,000 lines of code and we did not 
want to completely rewrite it 

- Vector ITPACK (standard Fortran version) : used a minimum of 
vector syntax available in CYBER 200 Fortran for a portable 
version of Vector ITPACK 2C 

- Vector ITPACK (CYBER 205 version) : a modified version of 
Vector ITPACK written using CYBER 200 Fortran vector syntax 
where possible 
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ADAPTING SCALAR ITPACK 2C FOR HIGH PERFORMANCE COMPUTERS 


- DO loops which had been unrolled for scalar optimization were 
not recognized as vectorizable by optimizing vector compilers 

- These loops were rewritten as simple tight DO loops so that 
they would be executed in vector mode 

- The sparse storage scheme used for the matrix in Scalar ITPACK 
was row-oriented and inhibited vectorization (The IA-JA-A data 
structure as in Yale software YSMP used.) 

- A column-oriented data structure was used in Vector ITPACK to 
increase vectorization (The COEF-JCOEF data structure as in 
Purdue software ELLPACK used.) 

- The version of Vector ITPACK specifically for the CYBER 205 
was tested on the CYBER 205 at Colorado State University (CSU) 
and has been added to their Program Library 

- The improvements in time of the vector syntax version over the 
one in standard Fortran were not as significant as we had 
anticipated 

- The automatic vectorization available in the CYBER 205 Fortran 
compiler did a very good job of optimization and vectorization 

Moral: vector syntax best when used in designing and writing 
new code 
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PROBLEM : 


U + 2 U = 0 
xx yy 


on S=(0,l)x(0,l) 


u = 1 + xy 


on boundary of S 


Discretization: standard 5-point finite difference formula 

-6 

Stopping Criterion: 5.0 x 10 

Mesh Sizes: 1/16; 1/32; 1/64; 1/128; 1/256 

Number of Unknowns: 225; 961; 3969; 16,129; 65,025 

Computer : CSU CYBER 205 

CYBER 200 Fortran: Large pages, unsafe vectorization 

Scalar ITPACK (unrolled DO-loops & YALE storage used; 

T.Q.M.S. version) 

Modified Scalar ITPACK (rolled DO-loops & minor changes: 

Q8SD0T used) 

Vector ITPACK (rolled DO-loops, ELLPACK storage, & 

CYBER 200 Fortran vector syntax used) 
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TABLE I: CHANGING SPARSE STORAGE 



(Iteration 

Times in Seconds 

with H = 1/64) 


Method 

Iterations 

Scalar 

Modified 

Vector 



ITPACK 

Scalar ITPACK 

ITPACK 

(Natural Ordering) 




JACOBI CG 

178 

2.509 

2.184 

.262 

JACOBI SI 

362 

5.214 

4.480 

.580 

SOR 

216 

4.700 

4.597 

2.453 

SSOR CG 

34 

1.976 

1.788 

.831 

SSOR SI 

43 

1.791 

1.682 

.970 

(Red-Black Ordering) 




JACOBI CG 

178 

2.402 

2.056 

.268 

JACOBI SI 

362 

4.987 

4.209 

.590 

SOR 

196 

4.110 

4.017 

.523 

SSOR CG 

341 

20.327 

18.472 

2.177 

SSOR SI 

196 

7.734 

6.690 

.701 

RS CG 

90 

1.445 

1.358 

.118 

RS SI 

182 

2.980 

2.779 

.223 
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TABLE II: CHANGING PROBLEM SIZE 
(Number of Iterations) 


Method 

H = 1/16 

1/32 

1/64 

1/128 

1/256 

(Natural Ordering) 





JACOBI CG 

49 

94 

178 

330 

629 

JACOBI SI 

se 

179 

362 

772 

1372 

SOR 

50 

104 

216 

422 

872 

SSOR CG 

16 

22 

34 

51 

73 

SSOR SI 

19 

29 

43 

61 

88 

(Red-Black 

Ordering) 





JACOBI CG 

49 

94 

178 

330 

629 

JACOBI SI 

56 

179 

362 

772 

1372 

SOR 

52 

10-1 

196 

396 

839 

SSOR CG 

34 

62 

341 

1058 

3061 

SSOR SI 

51 

107 

196 

373 

752 

RS CG 

25 

48 

90 

167 

321 

RS SI 

42 

88 

182 

375 

704 
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TABLE III: CHANGING PROBLEM SIZE 
(Iteration Time in Seconds) 


Method H = 

1/16 

1/32 

1/64 

1/128 

1/256 

(Natural Ordering) 





JACOBI CG 

.010 

.040 

.251 

1.800 

14.115 

JACOBI SI 

.014 

.091 

.560 

4.196 

28.741 

SOR 

.035 

.292 

2.446 

19.828 

164.940 

SSOR CG 

.027 

.133 

.828 

4.953 

28.157 

SSOR SI 

.029 

.163 

.967 

5.583 

32.249 

(Red-Black Ordering) 





JACOBI CG 

.010 

.041 

.257 

1.847 

14.511 

JACOBI SI 

.013 

.091 

.571 

4.277 

29.394 

SOR 

.011 

.066 

.475 

4.028 

34.939 

SSOR CG 

.018 

.075 

2.105 

25.779 

302.712 

SSOR SI 

.021 

. 113 

.663 

4.452 

36.053 

RS CG 

.006 

.019 

.109 

.757 

5.981 

RS SI 

.008 

.033 

.207 

1.557 

11.881 
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COMMENTS ON TABLE I 


- Two versions of Scalar ITPACK were compared with the CYBER 205 
version of Vector ITPACK 

- Mesh size H = 1/64 used for all runs 

- Scalar ITPACK: unrolled DO-loops used in basic vector 

operations for increased optimization on scalar computers 

- Modified Scalar ITPACK: standard tight DO-loops used 

- Vector Fortran compiler recognizes tight loops as vectorizable 
but not unrolled loops 

A slight increase in speed from Scalar to Modified Scalar 
version 

- Vector ITPACK uses tight loops, Fortran vector syntax, and a 
column-oriented sparse storage scheme 

- This data structure allows the matrix-vector product operation 
to vectorize to a great extent 


* Considerable improvement in performance from scalar to vector 
version of ITPACK * 
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COMMENTS ON TABLE II * III 

- These tables are results of using Vector ITPACK on the same 
problem with varying mesh sizes 

- The number of iterations increase as the problem size increase 

- Comparisons based on number of iterations misleading as to the 
best method! 

- On scalar computers, SOR with natural ordering is widely used 
while JACOBI is not but on vector computers . . . 

- Most efficient method on the CYBER 205: 

JACOBI CG method when natural ordering is used 
RS CG when red-black ordering is used 
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SCALAR ITPACK vs. VECTOR ITPACK 


- Total time for each method is not significantly greater than 
the iteration time in the . vector version (this was not the case 
in the scalar version) 

- Only N additional workspace locations required for the vector 
version over the scalar version 

Faster scaling and permuting of the system with the 
column-oriented sparse storage scheme 

- Improved performance of the SSOR methods with the red-black 
ordering in the vector version in spite of the greater number of 
iterations 
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A PRE-CONDITIONED CONJUGATE GRADIENT PACKAGE 


Thomas C. Oppe, a graduate student at UT Austin, is working on 
a package which allows flexibility in the choice of basic 
methods and acceleration schemes. 

The package has been designed to make the addition of further 
preconditionings and acceleration schemes easy. 

Particular attention has been paid to the choice of matrix 
storage schemes with a view to maximizing vectorizability . 


Features of Package: 

- Conjugate Gradient Acceleration 

- Pre-conditioning matrix Q (Jacobi, Symmetric Successive 
Overrelaxation, Reduced System, Incomplete Cholesky, Modified 
Incomplete Cholesky, Neumann Polynomial, Parameterized 
Polynomials, Other pre conditionings planned such as Incomplete 
Block Cyclic Reduction) 

- Realistic Stopping Tests 

- Automatic estimation of iteration parameters with adaptive 
procedures 

- Two possible data structures allowed 


158 



DATA STRUCTURES 


Data structures which allow vectorization to varying degree : 
EXAMPLE: 

4-1-2 0 

A = -1 4 0 -2 

-2 0 4 -1 

0-2-1 4 


ELLPACK Data Structure: 



4 -1 -2 


12 3 

COEF = 

4 -2 -1 

JCOEF = 

2 4 1 


4 -1 -2 


3 4 1 


4 -2 -1 


4 2 3 


- matrix-vector product vectorizes with the use of gathering 
routines 

operations such as forward (back) substitutions using lower 
(upper) triangular matrices do not vectorize 


DIAGONAL Data Structure : 

4 -1 -2 JCOEF = (0, 1, 2) 

COEF =4 0 -2 

4 -1 * 

4 * * 

- the matrix-vector product operation vectorizes without the use 
of gathering routines 

operations such as forward (back) substitution and 
factorizations vectorize to some extent 
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Uatraot 

Two of the most challenging problems of Organo- 
aetallic chemistry (loosely defined) are pollution 
control with the large space velocities needed and nit- 
rogen fixation, a prooess so capably done by nature and 
so relatively poorly done by man (industry). For a 
computational chemist these problems are on the fringe 
of what is possible with conventional computers (large 
models needed and accurate energetics required). A 
summary of the algorithmic modification needed to 
address these problems on a vector processor such as the 
Cyber 205 and a sketch of our findings to date on deNOx 
catalysis and nitrogen fixation are presented. 
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Introduction 


Two of the most challenging problems in Or g anon e t a 1 1 i c chem- 
istry (loosely defined) are pollution control with the large 
space velocities needed and nitrogen fixation, a process so 
capably done by nature and so relatively poorly done by man 
(industry). For a computational chemist these problems (and 
other similar problems) are on the fringe of what is possible 
with conventional computers (large models needed and accurate 
energetics required). The advent of vector processors such as 
the Cyber 205 is making such studies feasible. A summary of the 
algorithmic modification needed to address these problems on a 
vector processor is presented in section I, a sketch of the 
findings to date for deNOx catalysis is presented in section 
II, and finally a sketch of the nitrogen fixation results is 
presented in section III. 

I. Algorithmic Modification. 

The advent of vector processors is leading to a reexamination 
of fundamental computational algorithms of general use to comp- 
utational chemists and the redesign of large scale codes. The 
present work illustrates both processes for the Cyber 205 comp- 
uter. Reexamination of fundamental algorithms is illustrated 
with an examination of the similarity transform, a matrix oper- 
ation of use to computational chemists. Large scale code rede- 
sign is examined through the implementation of a highly vec- 
torized MC-SCF code. 

A. Similarity Transform. A common sequence of matrix operations 
is the similarity transform 

C * A T B A (1). 
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For computational chemistry applications the matrices B and C are 
usually symmetric and generally stored in lower diagonal form* If 
the initial B matrix is expanded from upper diagonal form to full 
matrix representation vector operations are possible for both 
matrix multiplications. The linked triad instruction on the Cyber 
205 is utilized for the first matrix multiplication and a vector 
dot product operation is used for the second matrix multipli- 
cation. In principle one could transpose matrix A and to use the 
linked triad instruction for both matrix multiplications; 
however, in this case since we only want slightly more than half 
of the final results the vector dot product is preferable as it 
permits selective manipulation of the column indices I and J. As 
is apparent from Table I the vectorized matrix transformation 
represents a substantial improvement over scalar mode with 
enhancements ranging from a factor of 10 to a factor of 40. Mote 
for the 300x300 matrix case we are still approximately a factor 
of 2 off the maximum rate for the Cyber 205. The consideration 
of an algorithm where several matrices are transformed at once is 
in order. In addition it should be noted from Table I that the 
expansion from lower diagonal form does not add a significant 
cost (less than 10 percent). Finally, it should be apparent that 
the MFLOPS rate will be independent of the number of orbitals 
involved (indices I and J); the vectorized loops run over number 
of functions not orbitals (indices K and L). 

B. SCF Coding Considerations. The fundamental kernel of self 

1 2 

consistent field (SCF) codes in general 9 is the energy 
expression 
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The integrals <X^lhlX v > and < X n X a I r I ^ v ^n ^ neei ^ only be 
evaluated once (for a given geometric point), stored conven- 
iently, and repeatively accessed during the orbital coefficient 
( C * ) and density matrix element -W) j ; D £ | ) optimization stages. 
For the Restricted Hartree Focfc (RHF) wavefunction Dj =* 2, D^j =* 
2, Dj 4 * -1, and the remaining terms are zero.* For wave- 
functions beyond RHF the wavefunction optimization step repre- 
sents a vast majority of the time needed to variat ionally deter- 
mine E # that is, the calculation of the X^ integrals is usually 
relatively insignificant. For this reason initial ve c t or i z a t io n 
efforts have concentrated on enhancing the time intensive stages 
of an MCSCF ( m u 1 t i c on f i g u r a t i o n SCF) program. It is generally 
accepted* that one of the most time intensive steps of a general 
MCSCF code is the 4 index transformation needed to convert the X^ 
integrals to 0 i integrals where 

•i -Vi 1 , <5 >- 

On scalar processors only the unique integrals are stored (the 
Canonical list) and the loops are structured so as to minimize 
the number of multiplications performed. On a vector processor 
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such as the Cyber 205 this step simply amounts to two sequential 
applications of the matrix transformation described in (1), This 
transformation will proceed at vector speed provided that for a 
given ij pair all k 1 integrals are available f o' r k>l (this 
corresponds to an effective doubling of the integral file from 
its canonical length). This expansion of the canonical integral 
tape is accomplished through a straightforward two level bin sort 
written to take advantage of the 2 million 64 bit words available 
on the Cyber 205^. Since the vectorizable portions of this 
integral transform are contained in the matrix transform 
discussed above, the timing information in Table I applies here. 
Four index transformations for 50 basis functions will proceed at 
28 MFLOPS and 300 basis function transformations in general will 
achieve 82 MFLOPS. Enhancements over scalar computation on the 
Cyber 2 0 5 will range from a factor of 9 to a factor of 34 for 50 
to 300 basis functions. For example, a full integral 
transformation for 50 basis functions will maximumly take 28 
seconds and for 100 basis functions 10 minutes on the Cyber 205. 

For a wide class of useful w a ve f unc t i on s (open-shell HF and 
perfect pairing-generalized valence bond [GVB-PP] are two such 
examples) the one- and two- electron density matrices Dj and 
are expressible in diagonal form; 1 that is, the only nonzero 
elements are 


D i 


2 f 


i f 


d ij 


a i j * 


and D j 4 


i j 


The energy expresion (2) simplifies to 


( 6 ). 


E 




a . . J . 
iJ iJ 


+ b . . K . . ) 

l j l j 


(7) . 
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whe r e 


J i j=(ii/jj)andK i j=(ij/ij) (8) 

are the usual Coulomb aad exchange integrals. Restricting our 
attention to this class of wavefunct ion leads to particularly 
simple variational equations^ part it ionab le into a step where 
occupied and virtual orbitals are mixed variationally (OCBSE)* 
and a step where independent occupied orbitals are mixed through 
pairwise rotations,^ The OCBSE step utilizes terms 
representable as a vectorizable summation of and operators 

< X fl l J i IX v > and <I ft lK i ll v > (9) 

whe r e 

<X (1 lJ i ll v > = l cj.ci ( n v I <rn ) 
a, n 

( 10 ) . 

<Xji I I X T > - £ cj.cj (lialvn) 

(T , n 

Tha t is 

<X IH.Ix > - 5 a . . <X |j.|X > + b..<X |x.|x > (11), 

i v L 1 j H J v ij fi j v ' ' ' 

j 

where a set of loops can be written (which are in linked triad 
form and will run at >170 MFLOPS for more than 50 basis 
functions) to evaluate the Ith hamiltonian (X runs from 1 to 
n ( n+1 ) / 2 ) . 

DO 300 J=1 f NHAM 
A = A(I,J) 

B =■ B ( 1 1 J ) 

DO 100 K * 1, MXS (12) 

100 H(X)=*H(K)+A*AJ(K,J) 

DO 200 1 , MXS 

200 H(K)=*H(K)+B*AK(K,J) 


As the rotations step utilizes a subset of the above integrals, 
the needed v e c t o r i z a t i o n effort is narrowed down to rapidly 
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generating the terns in (10). 


If all <y,n terns a > n are stored 


for a given jiv the double sums in (10) can be reduced to a single 
dot product over a combined index y of length n(n+l)/2 

<x |j 1 lx T > - l 
r 

„ . (13) 

<X (1 lK i ll v > - l Djj£ v 

r 

where 


j£ v * ( fiv/ on) ( 14 ) . 

* ((|ia/vn) + (pn/vo))/2 


Currently the D* are precalculated, stored, and used for an 
entire SCF iteration. Formulating the problem as in (13) permits 
vectors ranging from 1275 for 50 basis functions to 451 50 for 300 
basis functions. This step will function at between 80 and 100 
MFLOPS representing enhancements of between 40 and 50 over scalar 
computation on the Cyber 205. Table II summarizes the timing for 
calculations ranging up to a 79 basis function calculation con- 
sisting of 4096 spatial configurations; that is, a GVB-PP( 12 /24 ) 
wavefunction.^ If the calculation were stopped after the RHF 
step the SCF would represent less than 1% of the computational 
effort. Overall the GVB(12/24) wavefunction optimization repre- 
sents 14% of the total effort. This is in sharp contrast to 
computations on scalar computers where this step would account 
for greater than 95% of the effort. The timing for an SCF iter- 
ative cycle for three cases is broken down in Table III. Note 
that the time needed to generate the terms in (13) is comparable 
to that needed to diagonalize the variational hamiltonians 
( 0CBSE ) . 
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II. DeHOz Catalysis. 

The catalytic reduction of nitrogen oxides has become increa- 
singly important in recent years due to legislation aimed at 

reducing emission levels from non-b i o 1 o g i c a 1 sources 1 2 * * * * * 8 * . As Nitric 

7 

Oxide is the major NO z component of exhaust streams research has 

focused on the reduction of nitric oxide. Both homogeneous and 

8 — 11 

heterogenous deNO x studies have been performed . The use of 

base-netal catalysts is of particular interest due to their ready 

availability and low cost. A transition metal ion of singular 

importance in pollution control is Fe(II) either as the bulk 

oxide or ion exchanged into zeolites. These Iron systems have 

been demonstrated to catalyze the conversion of nitric oxide to 

8 9 

nitrogen with a co-reactant such as CO or H 2 * . The mechanism 

1 2 

originally proposed by Shelef and Kummer consists of a two 
stage oxidation reduction sequence. The initial step involves 
the coupling of two nitric oxides to form nitrous oxide plus an 
Iron oxide. 


2 NO — N 2 0 + f 0' (15) 

The thus formed nitrous oxide is rapidly reduced by the cata- 

lyst 8 ^ * * 10 . 


N 2 0 » N 2 + 'O' (16) 

Completing the cycle the Iron oxide is reduced by reaction with 

carbon monoxide forming carbon dioxide plus the regenerated cata- 

lytic site. 

'O' + CO ► C0 2 (17) 

Efforts have primarily been directed at characterizing reac- 
tion (15) as this is likely to be the kinetically most difficult 
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8 d 

stop 0 . For hoaogeneons systems (15) has been suggested to 
involve an intramolecular coupling of nitrosyls to form a 
dinitrogen dioxide ligand^* which rearranges to a bound cis 
hyponit rite . 



1 2 3 

Metal hyponitrites have been established to either decompose to 
nitrous oxide and the metal oxide^ 1 or react with carbon 
monoxide to from carbon dioxide and nitrous oxide^^" c . 

It should be stressed that transition metal dinitrogen di- 
oxide complexes have never been isolated nor unambiguously 
detected. Further* only a single mononuclear transition metal 
hyponitrite complex has been ident if ied AJ . 

In this section we report energetic support for the reaction 
sequence (IS) for a model Fe(II) system: the dinitrosyl complex 
of Iron dichloride FeC^fNO^ 1 *. The relative energetics^ and 
geometries^** for the chosen complex 1 , its coupled cognate 
dinitrogen dioxide complex 2* and the cis hyponitrite product 3* 
are discussed below. We find that the coupled products are 
potentially accessible; 2 is only 29 kcal/mol higher in energy 
than 1 and 3 only another 19 kcal/mol higher. These species* 
though unobserved* should be viable given an appropriate ligand 
backbone. Addition of waters of hydration profoundly affects the 
relative energies of the hdyrated forms of 1* 2* and 3 (4* 5* and 
6 respectively). We find that intermediates 5 and 6 are 
thermally accessible. Intermediate 5 is 24 kcal/mol more stable 
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than 4 and 6 is only 4 kcal/ool above 4 . This is not snprising 
as 1 is a 16 electron systea, 2 is a 14 electron system, and 3 is 
a 12 electron system (unusual participation by the pi lone pairs 
ras not observed in the wavefunction of 3 or 6). 

A correlation of the bonding orbitals demonstrates that the 

coupling reaction 1 to 2 or 4 to 5 will be thermally allowed 

(occupied reactant orbitals correlate with occupied product orb- 
1 7 

itals ). Further, the LUMO is a non-bonding d orbital of 
symmetry indicating that this correlation diagram will be valid 
for systems with up to 2 more electrons. Finally, one of the 
high lying occupied orbitals is a non-bonding A^ d orbital 
suggesting that the correlation diagram will be valid for systems 
with up to two fewer electrons. Thus group VI through group VIII 
metal dications are potential active catalysts. 

Because Fe(II) dinitrosyls are structurally uncharacterized, 
because only a single transition metal hyponitrite complex has 
been structurally characterized, and because dinitrogen dioxide 
complexes are unprecedented a detailed discussion of the bond 
distances and bond angles that were optimized is in order. We 
find the N-Fe-N angle for the dinitrosyl is 94.9 degrees, as 
expected for a {M(N0)2}^ systen^^, The Fe-M distance of 1,69 A 
is in agreement with experimental structures for linear Iron 
dinitrosyls (1.66 A*® a to 1.71 A*®^). For the dinitrogen dioxide 
complex 2 we find a N-N distance of 1.53 A, longer than normal N- 
N single bonds (ranging from 1.402 A to 1.492 but still 

significantly shorter than that for free dinitrogen dioxide (2,24 
A^®). This is consistent with substantial nitrogen-nitrogen sigma 
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bonding. The Fe-N distance found for the dinitrogen dioxide 
complex (2.23 A) is in accord with the Fe(II) nitrogen bond 
distance of 2.26 A^* in [ F e ( C ^ H g NH ) ^ ] [ F e ^ ( C 0 ) ^ g ] . Finally, for 
the cis hyponitrite complex 3 our Fe-0 distance of 1.74 A 
compares favorably with 1.69 A (the sum of the ionic radii for 
OH* (1.18 A) and an estimate for the ionic radius for four coor- 
dinate Fe(IV) (0.51 A)^^). Our N-N distance of 1.21 A is the same 
as the N-N distance determined by X-ray crystallography for 
[ ( P h 3 P ) 2 P t ( N 2 0 2 ) 1 * ^ ^ , the only structurally characterized 
hypon i trite. 

Summarizing, we have demonstrated that (17) is a probable 
reaction sequence for group VI through group VIII transition 
metal deNO^ catalysts. Specifically our energetics and correla- 
tion diagram suggest that dinitrogen dioxides are thermo- 
dynamically and linat ically accessible cognates of dinitrosyl 
complexes. We believe that these results can be extended to 
heterogeneous Fe(II) catalyzed deNO x processes as well. In fact 

Q c 

we speculate that the stretching frequencies observed by Hall 
at 1917 cm”* and 1815 cm* 1 are due to bound dinitrogen dioxide 

which is blue shifted relative to the free compound (which has 

2 3 — 1 — 1 

frequencies at 1870 cm and 1776 cm . Because the 

coordination sphere of Fe(II) ion exchanged into zeolites is 
2 4 

thought to contain three oxygen ligands our energetics suggest 
the frequencies assigned to a dinitrosyl are instead due to the 
kinetically accessible and thermodynamically favored dinitrogen 
dioxide moiety. Further, it should be noted that dinitrosyl 
stretching frequencies as high as 1 900 cm”* are rare. In 
conclusion we suggest that the kinetically (and thermodynamical- 
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ly) most difficult step in (17) is the isomerization of the 
dinitrogen dioxide complex 2 (or 5) to the cis hyponitrite com- 
plex 3 (or 6) . 

III. Nitrogen fixation. 

The fixation of dinitrogen is a reductive process of both bio- 
logical and large scale industrial interest. Thermodynamically 
the conversion of dinitrogen to ammonia is straightforward and 
the conversion to hydrazine is feasible under high pressures 
(AG 29 g for these processes are -7.9 kcal/mol and +22.0 kcal/mol 
respectively; if the pressure is increased to 100 atm then the 
^298 ^ ° r hydrazine formation is +16.7 kcal/mol). 

In the known nitrogen-fixing organisms the catalytic reduction 

of dinitrogen is carried out by aolybdoenzyaes known as nitro- 
2 5 

genases . These n i t r o g en- f ix in g enzymes consist of two protein 

components, a Fe-Mo protein and a Fe protein. Further, an iron- 

molybdenum cofactor has been isolated from the Fe-Mo component 

protein of nitrogenase. In fact extracts of the Mo-Fe component 

from inactive mutant strains of microorganisms are activated by 

addition of this cofactor. Two models of the active site have 

been proposed that are consistent with Mossbauer and EPR spectro- 

scopic data - and EXAFS analysis of the Fe-Mo cofactor. 

Unfortunately the models of such active sites synthesized to date 

2 8 — 30 

do not reduce dinitrogen 

Industrially, dinitrogen reduction occurs over an Iron cat- 
alyst at high temperatures and pressures. The rate determining 

3 1 

step is either the dissociative chen isorbt ion of dinitrogen 

2 * + N 2 > 2N-* (19) 
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or the simple chea i sorb t i on of an activated form of dinitrogen 


• + N 2 » N 2 -* (20) 

Both of these processes are likely followed by rapid reaction 
with hydrogen (either molecular hydrogen of chemisorbed atomic 
hydrogen) . 

Thus, for both biological and industrial nitrification the 
activation of dinitrogen is a prerequisite for reaction with 
rednctants such as hydrogen. Until very recently the observed 
forms of dinitrogen were bound to the metal with the nitrogen* 
nitrogen multiple bond largely intact ( n o n- a c t i v a t e d ) . 

MsNkNaM M *|| (21) 

7 

Thus these model compounds will only reduce dinitrogen under 

9 2 

rather harsh conditions . 

An understanding of a recently observed dinitrogen binding mode 
(analogous to organic azines) 

MwlN-NkM (22) 

S 

will provide additional insight into biological and industrial 
nitrification. The reactivity and structural characteristics of 
a new class of Tantalum complexes** suggest the bonding pattern 
8 in (22). The Ta-N bond distances of 1.796 A and 1.840 A are 
quite similar to those observed in normal Tantalum imido 
complexes^ (1.765 A to 1.77 A). In addition, reactions (23) and 
(24) are both observed (reactions characteristic of metal- 
ligand multiple bonding). 
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'M-O f 


(23) 


M-N-R + R' 2 C=«0 > R'jOMR + 

M=N-N=M + 2R 2 C-0 » R 2 C=»N-N»C 2 R + 2'M-O' (24) 

Finally, there is an observable 'activation' of the nitrogen- 
nitrogen bond (N-N bond distances of 1.282 A and 1.298 A compared 
to free dinitrogen which has a N-N bond distance of 1.0976 A). 

In this section we report energetic support for the kinetic 
and thermodynamic accessibility of 8 for molybdenum complexes. 
Our model consists of a bimetallic complex consisting of two 
Molybdenuote trichloride units bridged by a dinitrogen molecule. 
For this complex we have characterized the 'reaction path' 
connecting the two likely resonance structures 7 and 8 

C1 4 Mo-N=N-MoC 1 4 < » Cl 4 Mo~N-N«MoCl 4 (25) 

9 10 

We find local minima characteristic of each resonance structure 

indicating the 'resonance' interaction between these two forms is 

3 4 

not enough to result in a single averaged structure , However, 
the resonance interaction is sufficient to provide a very low 
barrier interconnecting them (less than 1 kcal/mol). Thermodyn- 
amically we find 9 to be 20 kcal/mol more stable than 10 for the 
tetrachloride ligand backbone. This thermodynamic difference 
could easily be overcome by an alteration of the ligand backbone 
and future studies will concentrate on this. Geometrically, for 
9 the Mo-N distance is 2.28 A and the N-N distance is 1.10 A and 
for 10 the Mo-N distance is 1.82 A and the N-N distance 1.23 A. 
This is in accord with a suggestion that the tetrachloride 
backbone does not fully activate the dinitrogen (a fully 
activated N-N distance should be on the order of 1.30 A). 
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Table I. Comparison of Soalar and Vector Matrix Transf oraat ions, 
(for various sized aatriees. tiaes in see.) 


Matrix 

Scalar 

(with. 

Opt . ) 

Vector (tiaes x 100) 

Katio 

MFLOPS 

size 

First 

Se e ond 

Total 

Expand 

First 

Second 

Total 

(S/V) 

(veo.) 

NxN) 

Mnlt . 

Mnlt. 

Tiae 

Array 

Mnlt. 

Mnlt. 

Tiae 



50 

0.041 

0.083 

0.124 

0.063 

0.78 

0.51 

1.36 

9.1 

27.8 

100 

0.32 

0.65 

0.96 

0.23 

3.65 

2.59 

6.48 

14.8 

46.5 

150 

1.07 

2.58 

3.64 

0.51 

9.34 

6.91 

16.76 

21.7 

60.5 

200 

2.52 

6. 74 

9.25 

1.01 

19.34 

14.32 

34.67 

26.7 

69.3 

250 

5.39 

14.35 

19.74 

1.83 

33.43 

25.64 

60.90 

32.4 

77.1 

300 

9.90 

27.14 

37.03 

2.92 

53.22 

42.23 

109.84 

33.7 

82.4 


Table II. Timing Breakdown for MC-SCF Energy Generation. 

(tiaes in seconds) 

Step Moleeale/No. of basis functions 

H 2 0/7 FeCl 2 -(H 2 0) 2 /43FeCl 2 (NO) 2 /d5 FeC lj (NO) 2 (H 2 0) 2 / 7 9 


Calculate 
One electron 
Integrals 

0 . 13 

36.4 

48.5 

81.0 

Calculate 
Two electron 
Integrals 

1.06 

86.6 

191.7 

535.5 

Sort Two 
Electron 
Integrals 

0.05 

14.7 

94.3 

247.7 

Gene r a t e 
Extended Hnckel 
Starting Guess 



0.8 

1.1 


Obtain 

Hartree Fock 
Energy 

0.11 

1.8 

3.1 


(10 it. ) 

Obtain 

MC-SCF 

Energy 



72.5 

137.5 

(10 it.) 
Total Time 

1.35 

140.3 

411.2 

1001 . 7 

% of Time 
HF 

8.1 

1.3 

0.8 


MC-SCF 

— 

— 

17.5 

13 . 7 
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Table III. SCF Timing Breakdown for an Individual Cycle 
(Times in seconds, rates in MFLOPS) 


Wave f one t ion 
Description 

. Generate 
and Z^ 
Matrices 

Transform 
and C- 
Matrices 

OCBSE 

Orb i t a 1 
Rotations 

Opt imize 
a ij attd b ij 

Total 


T ime 

Rate 

Time 

Time 

Time 

Time 

Time 

h 2 o MBS hf 

0.0001 

4.6 

0.006 

0.004 

— 

— 

0.011 

FoC1 2 • (h 2 o> 2 
HF 

0.0082 

49.0 

0.017 

0.078 

— 

— 

0.177 

FeCl 2 (NO ) 2 
HF 

0.0310 

60.6 

0.034 

0.241 

— 

— 

0.306 

GVB ( 12 / 24 ) 

2.012 

81.4 

2.832 

1.990 

0 .328 

0.091 

7.253 

FeCl 2 (NO) 2 (H 
GVB (12/24) 

l°U 2 

88.2 

5.322 

3 .515 

0.516 

0.090 

13.745 
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Abstract 

The supersonic flow field over a body of revolution 
incident to the free stream is simulated numerically on 
a large, array processor (the CDC Cyber 205). The 
configuration is composed of a cone-cylinder forebody 
followed by a conical afterbody from which emanates 
a centered, supersonic propulsive jet. The free-stream 
Mach number is 2, the jetrexit Mach number is 2.5, and 
the jet- to- free-stream static pressure ratio is 3. Both 
the external flow and the exhaust are ideal air at a 
common total temperature. The thin-layer approxima- 
tion to the time-dependent, compressible, Reynolds- 
averaged Navier-Stokes equations are solved using an 
implicit finite-difference algorithm. The data base, of 
5 million words, is structured in a “pencil" format so 
that efficient use of the array processor can be realized. 
The computer code is completely vectorized to take 
advantage of the data structure. Turbulence closure 
is acheived using an empirical algebraic eddy-viscosity 
model. The configuration and flow conditions cor- 
respond to published experimental tests and the com- 
puted solutions are consistent with the experimental 
data. 

Introduction 

In 1980, a computational study was described in 
which the three-dimensional flow field over axisym- 
metric boattailed bodies at moderate angles of attack 
was simulated. 1 The exhaust plumes were modeled by 
solid plume simulators, and a second-order-accurate, 
implicit finite-difference algorithm was used to solve 
the governing partial differential equations on the 
ILLIAC IV array processor. Several flow fields were 
computed and the results compared with published ex- 
perimental data. The promising results of that first 
study provided the incentive to extend thr work to 
include propulsive exhaust jets emanating' rrom the 
afterbody base. The ILLIAC IV was subsequently 
removed from service, however, and it became neces- 
sary to scale down the size and scope of the study to 
the capacity of existing computer resources. 
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hi January 1983, the results of a study of super- 
sonic axisymmetric flow over boattails containing a 
centered propulsive jet were presented. 2 Those results, 
obtained using a Cray IS computer with 10 d words 
of main memory, were compared with existing ex- 
perimental data. Jet- to- free- stream static pressure 
ratio and nozzle exit angle were varied parametrically; 
and the predicted trends agreed well with experiment. 

The purpose of this paper is to describe the 
vectorized implementation of the three-dimensional 
Navier-Stokes code on a Cyber 205 computer for boat- 
tailed afterbodies at moderate angles of attack that 
contain a centered propulsive jet. Some computed 
results, which correspond in part to a published ex- 
perimental study for a like configuration and flow con- 
ditions, are included for illustration. 

Afterbody Configuration 

The geometric configuration is a 9 caliber body of 
revolution composed of a 14 0 half-angle conical nose, 
a cylindrical forebody, and an 8° half-angle conical 
afterbody of 1 caliber length. Centered inside the 
afterbody is a conical nozzle with exit diameter of 0.6 
caliber that is flush with the afterbody base. The 
nozzle exit half-angle is 20°. 

Experimental studies for the same configuration 
were performed by White and Agrell 3 for the model 
immersed in an air stream flowing at — 2.0 and 
a jet-exit Mach number of 2.5. White and Agrell con- 
sidered angles of incidence to the free stream up to 8° 
and jet-to- free- stream static-pressure ratios up to 15 . 
Because of limited acces to the Cyber 205 computer, 
computed results are included in this paper only for 
the case in which the angle of incidence is 6° and the 
jet- to- free- st re am pressure ratio is 3 . 0 . 

Governing Equations 

The equations describing the flow are the 
Reynolds-averaged Navier-Stokes equations. These are 
written below in strong conservative form in general- 
ized coordinates as 

d t Q + d f (F • f) + a^F ■ T) + a^F ■ f) = o (i) 
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where 




P9 > 

Ft 


m + r-e. 

pv 


m .+ T-h, 

(no 


pwq + r-e. 


<eq + r-q — K,VTJ 


and and?, are the Cartesian unit vectors and 
J*, and?* are the contravariant base vectors, which, 
can be written aa 

^-U+Vr+e.^ 

7* ” ?»?*. + -f- 7,?* 

2* * &2*+ ff2y + &?«. 

Tile components of m om e nt u nity , and/rar, 

are in Cartesian space and the velocity* rector y 
is generally expressed in terms of the contravariant 
velocity components, U, V, and W as 

u2* -Jr utf* 

-cr?<+n r + w% 

where 3 ( , 3 n> and3 f are the covariant base vectors writ- 
ten as 

2 C =“ 

+■ y* 2 , + J v 2 » 

™ *jr?* *4” y,2^ + 2, 2* 

The Jacobian / of the transformation is gives by 

r ' = *«y.»*f -+- % y«*v + 

— *<yr*r — **y<*r — * f y«r*< 


The flux vector F can be decomposed into a 
parabolic part, Fp,’ which contains only gradient 
diffusive terms, and a hyperbolic part, Fh, which con- 
tains only convective-Jike terms, as 


Fh 


m y 

puq + pe t 
pvq + pe, 
pwq ■+■ pe, 

V (e + p)q J 


, Fp = F — Fff 


( 2 ) 


For flows in which the shear layers are thin (when J?e > 
> 1) and aligned with one principal plane (say the 
plane normal to the 7} coordinate), the parabolic part 
of F can be neglected in the other two coordinates (£ 
and $■ ), without any real loss in accuracy. This is con- 
sistent with boundary- layer theory and yet maintains 
the coupling between the viscous and inviscid regions 
that is critical in simulating interactive flows. With 
this thin-layer approximation, Eq. (1) is rewritten as: 


w-k d^Fn; ?)+ w • n + • r ) - o (3) 

Computational Grid 

A body-oriented computational grid is constructed 
in. a manner compatible with the thin-layer approxima- 
tion Shown in Fig. 1 is the grid used in the 
present computations. Figure la shows the complete 
conflgirntion and Fig. lb the detail in the base region 
of the afterbody. Radial grid lines on the forebody join 
the surface orthogonally. On the afterbody and in the 
exhaust plume^the radial lines are normal to the body 
axin There are 81 points distributed along the body, 




Fig. 1 Computational grid: bilateral plane of sym- 
metry. a) Complete configuration (140 x 100 x 20); 
b) Base-region detail. 
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with clustering near the nose and near the base. Of the 
81 points, 21 are used to define the afterbody shape; 
the afterbody is 1 caliber long. An additional 59 points 
are distributed downstream of the afterbody to a dis- 
tance equal to 21 forebody diameters from the nozzle 
base. These 140 total points define the £ coordinate 
distribution. The radial distribution, corresponding to 
the rj coordinate, extends from the body surface to a 
distance equal to 30 forebody diameters both ahead 
of the nose and normal to the body axis. A total of 
60 points is used in this region, with a high degree of 
streching used in order to resolve the sublayer of the 
turbulent boundary layer. (Here the first grid point off 
the body surface corresponds approximately to a value 
of 7+ of 8 where 7+ = (p w Tw) l ^l — *?«)//*»•) An 
additional 40 points are distributed across the nozzle 
and its blunt base, extending from the centerline to 
the body surface. Of these, 20 are in the jet exit plane 
and 20 are on the blunt base itself. 

One- and two-parameter hyperbolic-tangent strech- 
ing' functions 4 are used in the base region to focus 
resolution near the corners and to achieve a smooth, 
piecewise continuous distribution of points across the 
exhaust plume and base. At the nozzle exit, points are 
distributed along an arc describing the conical 
flow exit plane (that is, the arc radius is 
equal to the nozzle exit radius of 0.3 caliber 
divided by the sine of the nozzle-exit half- 
angle of 20°). Downstream of the nozzle, the 
grid lines are aligned so as to closely ap- 
proximate the exhaust plume shape for an ex- 
perimentally observed axisymmetric flow by Agrell 
and White, 3 which is for the same geometric 
configuration and fres- stream conditions, but for a. 
jet- to free- stream pressing ratio of 9. The 
third dimension, is generated by rotating the 
two-dimensional (£, 7) grid about the cylindrical 
axis while maintaining a uniform angular dis- 
tribution between the rotated planes. Here, 
20 radial planes are used with planes 2 and 
19 coinciding with the bilateral plane of sym- 
metry, where plane 2 corresponds to the lee and 
plane 19 to the windward. Planes 1 and 20 

are image planes used to enforce a symmetry 
boundary condition. Thus, there are (£, 7) 
planes distributed every 10.588° around the half- 
body. 

The total grid dimensions are (140 x 100 x 20), cor- 
responding to the £, tj, andf directions, iwspectively, 
for a total of 280,000 points. Of these, (80 x 40 x 20), 
or 64,000, lie inside the body and are not used in the 
computation, leaving an actual total of 216,000 points 
used in the computation. 

Data Structure 

There are 23 variables required at each grid point 
corresponding to the 5 conserved quantities in the Q 


vector, 5 residuals for the solution vector, 9 metric 
coefficients, the Jacobian of the transformation, and 

3 components of vorticity used in the turbulence 
transport model. This resuits, for a computational grid 
of 216,000 points, in a data base of 5 x 10* words. 

To accommodate this large data base on a vector 
processor with, a limited main memory, the computa- 
tional grid is divided into subsets called “blocks. 7 ’ This 
data structure was originally devised for implemen- 
tation on the ILLIAC IV array processor by Lomax 
and Pulliam and is described in detail in Ref. 6. In 
the present case, each block is a 20 x 20 x 20 cube 
for a total of 8,000 points and a data base subset of 
184,000 words for the 23 variables. The blocks are 
stacked together in each coordinate direction to form 
a sequence of blocks called “pencils.” 

For a given coordinate direction, one complete 
pencil of data is loaded into the central memory, and 
computations are performed on that data correspond- 
ing to the coordinate direction. At any point in the 
computation, only IT variables are required to be in the 
main memory at one time (6 of the 9 metric coefficients 
are not used in any given direction). This results in 
a datarbase subset of 136,000 words. For a proces- 
sor with 10* words of main memory then, as many 
as seven blocks of data can be held in storage for im- 
mediate processing. The block dimension is an ad- 
justable parameter and is limited only by the maxi- 
mum pencil length and the main memory of the vector 
processor. 

Shown in Fig. 2, in physical coordinates, are the 
block boundaries for the present configuration. Figure 
2a shows the complete configuration and Fig. 2b the 
detail in the afterbody region. Figure 3 shows the 
corresponding block structure in computational space. 
The mesh nodes of the computational domain are ar- 
ranged in a rectangular Iatice with positive integer 
coordinates (£,7, f). Each node belongs to three pen- 
cils, a ^-pencil, an 77-pencil, and a f-pencil. The pencils 
of each sweep direction are given a definite order. For 
the ^-pencils, the 7-coordinate varies most rapidly as 
the pencil index increases; for both the 7-penciis and 
f- pencils the coordinate £ varies most rapidly. Figure 

4 illustrates this sequencing for the present data struc- 
ture. 

Within a pencil, the planes are naturally ordered 
by the sweep coordinate. The pencils of data can be 
stored in the correct pencil ordering for jnst one sweep 
direction only. When sweeping in the other direc- 
tions, pencils of data are gathered and fetched for com- 
putation and scattered back when writing the updated 
values. Additionally, the ordering of nodes within a 
plane can be correct for jnst one sweep direction, and 
it is necessary to transpose the the data in memory 
so that each plane of nodes normal to the sweep direc- 
tion forms a contiguous set of memory locations. In 
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the present code, the ordering of nodes is correct for 
the ^direction and transpose routines are used for the 
other sweep directions. 
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Fig. 2 Block boundaries: physical space, a) Complete 
configuration; b) Base-region detail. 
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Fig. 4 Data structure within pencil data base. 
Numerical Algorithm 

The numerical algorithm used to solve Fq. (3)’ 
is- the approximate factored scheme of Bea~ and 
Warming/ Rewriting Eq. (3) as 


n 



Fig. 3 Block boundaries: computational space, com* 
piete configuration. 


9 t Q = -a i (F f rf)-d v (F-g , )-d f (F H -g') = * ( 4 ) 

the corresponding difference equation is then 

L n L<Lt±tQ = + R< ( 5 ) 

where the operators are defined by 

Lf = (/ H" At 6% A n — €[J 

L, =* (J + Ai5,C n - e t - AtS v 

L< * (/ + At 6 ( B n - €[ F l V ( A f J) 

£< - - A tS^JFfi ■ f) n - £E r l (v f A tfj Q n 
= -A tS^JF ■ FT - es F 1 (V n A„) 2 J Q n 
Z< = - A tS,(JF H ■ FT - F l (V f A ,fJ Q n 

where the 6{, and <J f are central-difference 

operators; V lf and are backward-difference 
operators; and and A ? are forward-difference 

operators in the rj-, and ^--directions, respectively. 
The At term is a forward-difference operator in time. 
For example, 
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and 


& t Q~Q*+ l -Q* 

A«g - g(f + AC, n, c) - Q(Z, n, i) 


Begin Loop 


V& - «(f, 7, f) - Qi€ - AC n, f ) 
The Jacobian matrices 


A » 9q(Fh • ?*) 

B ** 3q(Fh • j*) 

C^doiFn-f) 

M^d^Fp-g*) 

are described in detail by Pulliam and Steger.* Fourth- 
order explicit terms (preceded by the coefficient eg) 
and second-order implicit terms (preceded by the 
coefficient (/) have been added to control nonlinear in- 
stabilities. 

Equation: (5) is solved in three successive sweeps 
of the data base, each sweep inverting one of the 
operators on the left-hand side: 


^.pencils: 

Read: Q, J, R, w, ^-metrics 

Transpose: Q, J, R, u 
Compote: R » Zf + J Z v 

v * «(£) + "(f) 

Transpose: R, « 

Write: R, u 

rj-pencils: 

Read: Q, J, R, cj, r;- metrics 

Transpose: Q, J, R, w 
Compute: u — w(€) + w(f ) + w(? 7 ) 

Mt(w) 

r— *< + *, + *, 

i.;*(R) 

Transpose: jL‘ 1 (R) 

Write: t; l (R) 


- c; 1 (**+*,+*,) 

L< A t Q - r/r , 1 ( + *„ + z t ) 
a «q = r e l r f l r; ( + *„ + *, ) 

The solution is advanced in time by adding. A*43 to Q 
after the { sweep. 

In the general case, pencils of data are loaded into 
central memory four times* and operated on for each 
time-step advance: once each for the £ and rj direc- 
tions and twice for the f direction. First the right- 
hand side of Eq. (5) is formed and then the left- 
hand-side operators are inverted one by one. A flow 
schematic showing the ordering of operations, includ- 
ing data reads, transposes, computations, and data 
writes is shown beiow where the symbols R and u 
represent variables used to accumulate the right-hand- 
side elements and vorticity elements, respectively, for 
each coordinate direction. 

{-pencils: (initial step only) 

Read: Q, J , {-metrics 

Compute: R = c j a o^£) 

Write: R, u 


f-pencils; 

Read: Q, J, L~ n l (R), f-metrics 

Transpose; Q, J , ^‘(R) 

Compute: R) 

Transpose: i; l L;‘(R) 

Write: 

{-pencils: 

Read; Q, J , LJ 1 X.^(R), £-metrics 

Compute: & t Q t <3, R = Z^ <j = oj(£) 

Write: Q, R, t u 

End Loop 

In this flow sequence, 62 variables are read, 57 
variables are transposed, and 31 variables are written. 
Fbr the special case in the present stndy in which the f- 
pencils are just one block long, a more efficient opera- 
tion sequence can be used that substantially reduces 
the number of reads and writes required. This is shown 
below. 

{^-pencils: (initial step only) 


Read: Q, J, {-metrics 

Compute: R = Z{, cj = <j({) 
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Begin Loop 
(-pencils: 


Read: 

Transpose: 

Compute: 

Transpose: 
Write: 

^•pencils: 

Read: 
Transpose: 
Compote: 


^-metrics 

Q, J> R, w 
r 

(J s w(£) + w(() 

R, ui 
R, u 


Q, J, R, w, ^metrics 

<3> ARrW 

Ht(u) 

^(R) 


f-pencils: 


Read: 

Transpose:* 

Compute: 

Transpose: 

Write: 


^-metrics 

Q. j, L‘J( R) 
r f ‘£.; l (R)' 

r) 


^pencils: 

Read: <?, /, L~ l L~ l (R), ^metrics 

Compute: A t <3, <? 

— w(0 

Write: 

Eud Loop 

In this flow sequence, 32 variables are read, 52 are 
transposed, and 18 variables are written, a savings of 
nearly 50% in the I/O- In both the general case and the 
special case, the data read-transpose sequence and the 
transpose-write sequence can be replaced by/the more 
efficient “gather" and “scatter* command* available 
for the Cyber 205 (Ref. 9). Further improvements in 
efficiency can be obtained by using asynchronous I/O 
in conjunction with a rotating memory backing store. 
The most efficient code, however, will be realized by 
using a solid-state backing store in conjunction with 
gather and scatter commands or with a code that is 
fully core contained. 

The numerical algorithm conforms well to large 
vectorization. For block sizes of 20 x 20 x 20, the vector 


length is 400. Timing studies with the present code In- 
dicate an MFLOP rate (million of floating- point opera- 
tions per second) on 15 when computing in half preci- 
sion (32-bit word lengths) on a 2-pipe configuration. 
On a 4-pipe configuration the MFLOP rate increased 
to 20T. There are approximately 3,800 floating point 
operations executed for every grid node per time step 
resulting in a CPU time of 33 x 10"* sec per point per 
time-step on a 2-pipe machine and 18 x 10"* sec per 
point per time-step on a 4-pipe machine. The transpose 
times (transposes do not contain any floating-point 
operations) are 5.6 x 10“ e sec per point. Equivalent 
transposes performed by gather and scatter instruc- 
tions require just 1.8 x 10"* sec per point. When 
synchronized I/O to and from rotating backing store 
was used, the average I/O time was 25 msec per vari- 
able per block. This translates directly into 172 x 1(T* 
see per point, but overiapping the I/O reduces this to 
94 x 1(T* sec per point. (The Cyber 205 used for these 
timing studies was configured with four I/O channels 
ta accommodate overlapping.) This time, a result in 
large part of the latency time in accessing disk flies, 
can be reduced to. nearly zero by using I/O buffers in 
conjunction with asynchronous I/O or with solid-state 
backing storage. The use of I/O buffers, however, im- 
plies the availability of additional main memory and 
Imposes an additional constraint on the pencil size. To 
avoid this constraint, the data flow should be modified 
such that a subset of contiguous blocks of data in a 
pencil are operated on while blocks at each end of the 
subset are being buffered in and out. 

Boundary Conditions 

Boundary conditions are imposed at the ends of 
each data pencil; the data pencils are identified by 
number in Fig. 3. For the ^-direction, pencil No. 1 
starts at the jet-exit plane. Supersonic conical flow 
conditions corresponding to a jet-exit Mach number of 
2.5 and a static pressure of 3poo are imposed at the 
first data plane. At the last plane of each of the five 
^-pencils, which correspond to the outflow boundary, 
first-order extrapolation is used so that d$Q = 0. 
Pencil No. 2 in the ^-direction begins at the blunt base. 
Here slip conditions and an impermeable adiabatic wail 
are imposed so that 

a^p) = d^pv) = a^pw) = o 

pti = 0 

— ■ Q.5(pu 2 -f- pv 2 + P w ^)\ — 0 

Pencils 3, 4, and 5 in the ^-direction begin on the 
grid centerline of revolution (at £ = 0) ahead of the 
forebody nose. Here a second order extrapolation to 
the centerline is used such that 

a dp) = ddpv) — adpw) = ^e) = o 

■while the lateral momentum is set to zero 
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/w»0 

In addition, at each rj, the Q values are averaged over 
f on the centerline and used as boundary values for all 
f at each 7. Special treatment of the base comer at 
the afterbody-biunt-base junction is used to account 
for the singular nature of that line. For the {-sweeps, 
the f-line of data in pencil No. 3 that corresponds to 
this comer is treated in the same manner as the first 
plane of data in pencil No. 2 that corresponds to the 
blunt base. This line of data is treated differently in 
the 7- sweep and is described in the second paragraph, 
following. 

After the forebody flow field is fully developed 
during the course of the solution, the flrst two 7-pencils 
can be dropped from the computation and boundary 
conditions imposed on the {-pencils that correspond 
to the fully developed flow at the plane that is the 
upstream boundary of 7-pencil No. 3. This reduces 
the total date base by six blocks without altering the 
validity of the solution. This simplification, is strictly 
valid only for supersonic external flows. The solution 
downstream can be further developed to steady state, 
and jet parameters can even be varied to generate ad- 
ditional solutions. 

Boundary conditions for the 7-direction consist of 
the imposition of free stream conditions at the last 
plane of each of the seven 7-pencils; no-slip, adiabatic 
wall condition for the flrst plane of 7-pencil^ 1 through 
4, which correspond to the body surface; and first-order 
extrapolation to the centerline for pencils 5, 6, and T 
such that d^Q = 0. Centerline averaging, as described 
for the {-pencil boundary ^ ahead of the body, is also 
used for the 7-pencil boundary in the jet. The line of 
data in 7- pencil No. 5, which corresponds to the comer 
between the afterbody and the blunt base, is treated 
in the same manner as the flrst plane of 7-pencils 1 
through 4. As a result, this line of data is double 
valued: one value for the £ sweep described previously 
and the no-slip, adiabatic value for the 7-sweep. 

For the ^-direction, bilateral symmetry is imposed 
by setting the data at the first and last ^-planes equal 
to the values in the third plane and in the second from 
last plane, respectively, with a sign change included in 
the lateral momentum component ( pv ). 

Turbulence Closure 

The Reynolds stresses and turbulent heat-flux 
terms have been included in the stress tensor and 
heat- flux vector by using the eddy- viscosity and eddy- 
conductivity concept, whereby the coefficients of visr 
cosity and thermal conductivity are the sum of the 
molecular (laminar) part and an eddy (turbulent) part. 
Eddy-viscosity models incorporate turbulent transport 
into the molecular- transport stress tensor by adding 
the scalar eddy- viscosity transport coefficient to 


the coefficient of molecular viscosity, ( =* M + 

Mr)* thereby relating turbulent transport directly to 
gradients of the mean-flow variables. In a Cartesian 
coordinate system, the three-dimensional molecular 
strew tensor can be written as 

+ (P + VjfiyZy + 

“h (p “b 

In the thin-shear-layer approximation, the only com- 
ponents of the stress tensor that are retained are those 
having gradients with respect to 7 only. 

Turbulent heat transport is defined in terms 
of mean-energy gradients and an eddy-condnctivity 
coefficient IT, such that » K + Kr . Typically, 
the eddy-condnctivity coefficient is related to the eddy- 
viscosity coefficient via a turbulent PrandtL number 
Ptt where 

Prr “ C^pt/Kt 

The turbulent Prandtl number is assumed constant at 
a value of 0.9. 

The algebraic eddy-viscosity model used here is 
that proposed by Baldwin and Lomax. 20 This model 
is particularly well suited to complex flows that con- 
tain regions in which the length scales are not clearly 
defined. It is described briefly as follows: For wall- 
bounded shear layers, a two-layer formulation is used 
such that 

M 7 * " (Mt)m»*«t for 7 < 7erMf«««r 

Mr * (MT)r»Ur for 7 ^ 

where 7 is the normal distance from the wall and 
7 irtiMNr is the smallest value of 7 at which values 
from the inner and outer formulas are equal. The 
Prandtl-Van Driest formulation is used in the inner (or 
wall) region. 

{t*T)inn*r — p£*|w| 

i — O.47 (1 — cxp( — 7/A)l 

A = 2 Qfiv/y/PvT*, 

The formulation for the outer region is given by 


(mt )m<it * 0.0168 <7 C p Fjcuk(v) 



The quantities rj maM and F max are determined from 
the function 

^(7) ~ 7 M [1 - *rp(— 7/A)] 

where F mas is the maximum value of ^(7), and 7 mo * is 
the value of 7 at which it occurs. The function Ffciebiv) 
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is the Klebanoff intermittency function given by 
FkUv) - [1 + SACKUnfnmmfr 1 

The quantity is the difference between the maxi- 
mum and minimum total squared Telocity in the profile 
(along an ^-coordinate line). 



and for boundary layers, the minimum {j defined as 
zero. The other constants are given by 

C cp = 1.6 , C wk * 0.25 , C Kuk * 0.3 

The advantage of this model for boundary- layer 
flows are as follows: 1) for the inner region, the velocity 
and length scales, are always well defined, and the 
model is consistent with the “law ofthewalT; 2) in the 
outer region for well- be hared (simple) boundary layers, 
where there is a well-defined length scale (7m**), the 
velocity scale is determined by F mmaf which Is a length 
scale times a vorticity scale; 3) in the outer region of 
complex boundary layers where the length from a wall 
becomes meaningless, a new length scale is determined 
from a velocity (qd%f) divided by a velocity gradient 
( [wf ), and the velocity scale is 7*/. 

The outer formulation, which is independent of 7, 
is also used in the free- shear flow regions of separated 
flow and in regions of strong riscous/inviscid inter- 
action. In these* regions the van Driest damping 
term, [exp(— 7M)h is neglected. For jets and wakes, 
the KlebanofT intermittency factor is determined by 
measuring from the grid centerline, and the minimum 
term in q^f is evaluated from the profile instead of 
being defined as zero. 

The validity of the eddy-viscosity model constants 
for high-pressure, compressible exhaust jets has not 
been established, and compressibility effects are not 
accounted for. 

At the exhaust-jet exit plane and in the near-base 
region, the eddy viscosity is assumed to be negligibly 
smalt and to increase spatially to the value given by 
the outer model over a short distance downstream of 
the base. 

Computed Results 

As mentioned in a preceding section (Afterbody 
Configuration), a flow field has been computed for 
the body placed at an angle of incidence of 6° to 
a free stream at Mach 2. The jet-exit Mach num- 
ber is 2.5 with a static pressure 3 times that of the 
free stream. Beginning with an impulsive start in a 
uniformly flowing stream at Mach 2, the solution was 
advanced timewise to a dimensionless time {t d/Uoo} of 
5.1, where d is the fore body diameter and is the 
undisturbed free-stream speed. Although a solution 


at & time of 5.1 is probably not sufficiently converged 
to permit valid quantitative comparisons with experi- 
ment, it is sufficient to establish the basic flow-field 
character and to illustrate the features of the solution 
and the computer code. 

The initial time-step size of At *0.0001 was in- 
creased to At *0.001 as the solution passed through its 
initial rapid transient. A variable time-step was used 
in the subsonic flow regime downstream of the base in 
order to minimize the growth of nonlinear instabilities 
aggravated by changes in sign of the eigen-values in 
this region. The time-steps in this subsonic region were 
scaled down by a factor equal to the local streamwise 
Mach number with a cutoff minimum factor of 0.001 
imposed to prevent the time-step from going to zero. 

Occurring* physically in this region is a rapid ex- 
pansion of the jet around the nozzle lip followed im- 
mediately by a strong recompression is the form of a 
barrel shock; is addition there is a slip surface defining 
the boundary between the exhaust plume and the ex- 
ternal flow. Each of these three high-gradient features 
is focused at the nozzle lip and demands a high degree 
of resolution that has not been provided for in the com- 
putational grid used here. 

Shown in Fig: 5 are computed density contours 
in the bilateral plane of symmetry in the vicinity of 
the body. The lower surface is the wind side. Clearly 
defined downstream of the afterbody is the slip sur- 
face demarcating the boundary between the exhaust 
plume and the external flow. The propulsive jet ex- 
pands rapidly around the nozzle lip and can induce low 
separation os the afterbody surface. For low-pressure 
jets, or no jet at all, there will be a region of recir- 
enlating flow on the blunt base. The afterbody drag is 
strongiy influenced by the detail of the separated flow. 



Fig. 5 Computed density contours, plane of symmetry: 
Af« = 2, Ms = 2.5, Pj/Poo = 3, 
or = 6°, 1.5x10*. 
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Fig: 6 A fter body flair detail: surface streamlines and density contours on bil atera l plana of symmetry. 


The detail of the separation patters is shown is 
Fig. 6 in which computed surface streamlines hare 
been mapped on the afterbody and projected on the 
bilateral plane-of-symmetry riew of the density con?- 
tour plot over the aft portion of the body only. There 
is a separation node on the lee generator of the coni- 
cal afterbody at * — 8.92. All surface streamlines on 
the lee side of the body flow into this node. A line of 
separation extends from this node, downward on the 
afterbody surface, to a separation saddle afrx * 8.98, . 
33" from the wind generator. The flaw direction along 
this line of separation is upward from the saddle to the 
node. There is also flow outward from the separation 
saddle downward to the end of the base, around to the 
wind generator. 


Shown in Fig. T is a perspective view of the 
surface streamlines on the afterbody and the blunt 
base. Tbe outer edge of the base is a dividing surface 



Fig. 7 Perspective view of surface streamlines over 
conical afterbody and annular base. 


streamline extending from a saddle point on the lee 
generator to a node point approximately 33* from the 
wind generator. A dividing streamline .can be seen cir- 
cumscribing the annular base connecting a saddle point 
on the windward and a nodal point on the lee. This 
line separates the external flow from the flow from the 
jet. Flow is upward from the windward saddle to the 
lee-side node. 

Shown in Fig. 3 is a sketch of an end-view projec- 
tion of the foil view of the afterbody (not to scale) 
showing all the dividing streamlines and their cor- 
responding singular points and flow directions. 
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The trajectories of the fluid particles in the plane 
of symmetry in the base region are shown in Fig. 9. On 
the lee, seen in Fig. 9a, the fluid from the jet expands 
around the nozzle Gp and mores outward toward the 
edge of the base. Upon meeting the external flow, it 
turns downstream define s the exhaust plume boun- 
dary. A region of re rer s e flow can be dearly seen abore 
the afterbody lee generator. The path of the fluid in 
the external flow is orer this separation region and 
around the afterbody base to the slip surface defining 
the boundary between the exhaust plume and external 
flow. The point defined by the outer edge of the base 
and the afterbody lee generator is a singular point that 
from the fluid streamlines, appears as a saddle point in 
both the circumferential plane and in the radial plane, 
and as a nodal point in the streamwise bilateral plane 
of symmetry (the plane of the base). 

On the windward, shown in Fig. 9b, the stream- 
lines just off the wind generator of the afterbody 
turn the corner and more toward the slip surface 
between^ the jet. and the external flow. Ail external 
flow streamlines (excluding the surface streamline) ap- 
proach the slip surface downstream of a saddle point 
in the bilateral plane of symmetry located at * » 
9.016 on the plume-external flow boundary. The sur- 
face streamline turns the corner and approaches the 
windward saddle point on the base itself. Fluid from 
the jet expands around the nozzle lip and mores out- 
ward. The fluid just off the Up mores to the saddle 
point on the base and the fluid farther inside the lip 
expands toward the plume boundary downstream of 
the saddle point on the slip surface. 
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Surface-pressure distributions orer the afterbody 
surface and orer the base are shown in Figs. 10a 
and. 10b, respectively. An expansion at the forebody- 
afterbody junction orer the afterbody surface can be 
seen. This expansion is greatest on the windward, 
where the pressure level is highest, and decreases 
toward the lee; The circumferential variation of pres- 
sure near the lee side is quite small for the entire length 
of the afterbody. Toward the end of the afterbody 
there is a slight recompression on the lee side which 
is not observed on the windward. Just at the end of 
the afterbody there is an expansion as the flow turns 
around the afterbody toward the base. 

Figure 10b shows a projected view of the base 
and jet-exit pressure distribution. The left side of the 
*top hat” pressure distribution corresponds to the lee, 
and the far side corresponds to the windward. The 
large uniform pre ssu re distribution of the “top hat” 
configuration corresponds to the high-pressure jet, and 
the undulating "brim” of the hat is the distribution on 
the base. On the windward there is a rapid 

expansion at the nozzle Up followed by a fairly large 
recompression toward the outer edge of the base. The 
ana trend is observed at other radial positions around 
the base but to a lesser de g ree. The circumferential 
variation of base pressure is consistent with the ex- 
perimentally ob s e rv e d variation of White and Agreil 
for the same jet-to-free-stream pressure ratio. It is 
interesting to note, however, that in most experimen- 
tal studies the radial variation of pressure is assumed 
negligible and is not measured. The distribution in Fig. 
10b clearly indicates a substantial variation across the 
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Fig. 9 Base-region path lines: plane of symmetry, a) Lee; b) Windward 


196 




i 


Fig: 1(X Surface pressure distributions perspective. a) Conical afterbody; b) Arnmlae base and jet exit plane. 



Concluding Remarks 

An implicit solution- p r o c ed ur e for the thin- 
layer. approximation to the three-dimensional, time- 
dependent, compressible, Reynolds- a v era ge d Nsvier- 
Stokes equations on a large array processor has been 
described. An example problem was simulated on the 
Cyber 205 computer that required a data, base of 5 x 
10* -words. The efficient treatment of this- large data 
base has been described in some detail. 

The flow-field simulated was the supersonic flow 
over a body of revolution at incidence to the- free 
stream. A propulsive jet emanated from, the boattailed 
afterbody, inducing a complex, three-dimensional 
separated- flow pattern. This separated flow- Held, 
which contributes substantially to the afterbody drag, 
has been described in detail for the particular geometry 
and flow conditions considered. The computed solu- 
tion is consistent with experimental data observed for 
the same configuration and flow conditions. 
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ABSTRACT 


Viscous flow past a circular cylinder becomes unstable around 
Reynolds number Re = 40. With a numerical technique based on Newton's 
method and made possible by the use of a supercomputer, steady (but 
unstable) solutions have been calculated up to Re = 400. It is found 
that the wake continues to grow in length approximately linearly with 
Re. However, in conflict with available asymptotic predictions, the 
width starts to increase very rapidly around Re = 500. All numerical 
calculations have been performed on the CDC Cyber 205 at the CDC 
Service Center in Arden Hills, Minnesota. 


INTRODUCTION 

The structure of viscous steady flow past a circular cylinder at 
high Reynolds numbers forms one of the classical problems in fluid 
mechanics. In spite of much attention, several fundamental questions 
remain open. Apart from a previous calculation by the present author 
[6], complete, steady flow fields have been obtained numerically only 
up to around Re - 100. This is also close to the upper limit for 
experiments (due to temporal instabilities). Both the early numerics 
and the experiments point to a recirculation region growing linearly 
in length with Re. Figure 1 shows the length of the wake bubble 
against Reynolds number according to some different calculations. 
Persistence of this growth for Re -> oo has been assumed in most 
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recent asymptotic studies of steady high Reynolds number flows past a 
body (e.g. F.T. Smith [ 1 3 3 ) - A possible Euler flow, consistent with 
this idea, was analyzed by Brodetsky [ 3 ] in 1923- It is known as the 
Helmholtz- Kirchhoff free streamline model. This suggested limit is 
characterized by two vortex sheets leaving the body tangentially 
approximately 55° from the upwind center line and extending to 
downstream infinity, enclosing a region of stagnant flow. Although 
this undoubtedly is a solution for Re = 00, G.K. Batchelor [ 2 ] gave 
in 1956 several arguments against this being a possible limit for 
Re -> 00 . He proposed an alternative in which a finite wake with 
piecewise constant vorticity was bounded by vortex sheets. Some 
suggestions about how such a flow might be reached as a limit for 
increasing Reynolds number have been given by Peregrine [lO]. 
However, only very few Euler solutions of this so called 
Prandtl-Batchelor type have been calculated (e.g.[l2] contains one 
example and some further references). None of these are for flow past 
a cylinder. Figure 2 gives an * artists impression' of what the two 
models for infinite Re might look like. The calculation [6] hinted 
at a process leading to a shortening of the wake. The present work 
suggests (in agreement with P.T. Smith [ 14 ]) this shortening at 
Re = 300 was erroneous and caused by insufficient numerical 
resolution. However, our best current evidence is that the 
qualitative result was correct. We beleive that a reversal of trends 
towards a shorter wake can be expected around Re = 500. This 
contrasts with the conclusions in [ 1 4- ] • Our main evidence is that 
the wake increases in width far more rapidly after Re = 300 than the 
asymtotic analysis allows for. Independently of the position of 
artificial boundaries and of numerical resolution, we find that the 
flow is of different character past Re = 300. Significant amounts of 
vorticity are then re-circulated back into the wake bubble from its 
end. We hope to soon carry this study past Re = 400. 

All the numerical calculations in this present work were 
performed on the Control Data Corporation Cyber 205 computer located 
at the CDC Service Center in Arden Hills, Minnesota. We wish to 
express our gratitude to Control Data Corporation for making this 
system available for this work. 


202 



MATHEMATICAL FORMULATION. 


With a cylinder of radius 1 and a Reynolds number based on the 
diameter, the governing time independent Navier-Stokes equations, 


expressed in streamfunction 

^ and 

vorticity oo » take the form: 

(D 

+ Ui = 0 



(2) 

Re » 3V 
+ — { — 

dto 


. - 

— . — \ = o 


2 S x 

d y 

b y b x ' 


Accurate numerical approximation and economical computational 
solution of these equations in the given geometry poses a series of 
difficulties which previous investigators have dealt with in a 
variety of ways. The most serious of the difficulties seem to be: 


1 . Boundary comditions for ^ at large distances. 

2. Boundary condition for U> at the body surface. 

3* Avoiding the loss of accuracy that comes with upwind 

differencing. 

4. Economical choice of computational grid. 

5. Reliable and fast rate of convergence of numerical 
iterations. 

The point 5 above has been the limiting factor in virtually all 
previous attempts to reach high Reynolds numbers. No reliable 
technique has emerged to prevent slowly converging iteration schemes 
from picking up physical instabilities in the artificial time of the 
iterations . 
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NUMERICAL METHOD 


All vorticity is concentrated on the body surface and in a quite 
thin wake downstream of the body. Outside this region we can use the 
much simpler equations: 

(3) A" 1 ? = 0 

(4) to = 0 

The top part of Figure 3 shows the upper half plane minus a unit 
circle and, dotted, a region which contains all the vorticity (apart 
from the far wake). The bottom part of the figure shows how the 
mapping z = [/iT + l /| Jx maps these to the first quadrant and a 
rectangle respectively. Figure 4 shows what a rectangular grid in 
the z-plane (with non-uniform stretching in the vertical direction) 
can look like in the x-plane. The Navier-Stokes equations, 
transformed to the z-plane take a form almost identical to (l ) and 
( 2 ): 


(5) a*\£ + w/j = o 


( 6 ) 


A w + 


Re f 

T * 


dec 


3T v 


o 


1 (S-Tfc 

where J = ^ J is Jacobian of the mapping. These equations 

were modified further by subtracting out potential flow. The stream 
function for the difference is Y = Y-2 S 3 . On a grid in the 
(stretched) z-plane, equations (5) and (6) were approximated at all 
interior points with centered second order finite differences. To 
close the system, boundary conditions have to be implemented for 
and to at all boundaries. 


The extreme sensitivity of the final solution to small errors in 
these conditions has only recently been fully recognized [6]. For 
example already at Re = 2 it was found that use of the free stram 
value for V' along circular outer boundaries at distances 23.1 and 
91.5 caused 18 % and 4.4 % errors in the level of vorticity on the 
body surface. 


204 



The 'Oseen' approximation is the leading term in an asymptotic 
expansion for the flow far out in a wake (e.g. Imai [8]). In polar 
coordinates, it takes the form 

C 6 

(7) ^ = ---- ( - erf Q ) 

2 ^ 

2 

C^Re Q -Q 

(8) t o = e 

4 \IX r 

where Q = (^Re r) /a sin4j_0 , erf Q = 2TC' <z ^e“ S ds and C e the drag 
coefficient. C p can be evaluated as a line 0 integral around the body. 

The performance of this Oseen condition as an outer boundary 
condition is disappointing. The percentage errors mentioned above 
improve, but only to to 3.4 % and 1.2 % respectively. For increasing 
Re, direct use of (7) becomes meaningless. Figure 5 illustrates this 
by comparing the true ^ (here the difference between streamfunction 
and free stream, not potential flow) with the values from (7) at 
Re = 200. The two fields bear no resemblance to each other at the 
distances from the body we are interested in. 

Comparison with numerics suggest that (8) is far more accurate 
than (7). Furthermore 

1 . Any errors in (8) are present only in a very narrow region 
along the outflow axis, not along the whole upper boundary 
as with (7). 

2. The governing equation for tO is of a type which cannot 
transport incorrect information for CO back up towards the 
cylinder. 

With this background, let us briefly outline how the boundary 
conditions of high accuracy can be implemented on the edges of the 
present computational region. Figure 6 shows this region in the 
z-plane with a typical vorticity field together with its reflections 
in the coordinate axis. 
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BOUNDARY CONDITIONS FOR 4\ 


Left boundary: 5= 0 ,0 <f)< 9»j* 
Bottom boundary: "Q = 0 ,0 < j~< ^ . 
Right boundary: J=S M >0 SOSIv’ 

Top boundary: v ]= 


H>- o. 


4 >= 0 . 


5^1 = fco (noting that << 

<< along this boundary). 

J Jwln((Vjf + (V^ )/j d S d 3 


A correction to the integral above for vorticity reaching 
outside the downstream boundary can easily be incorporated. For a 
fixed grid, the dependence of Hf at each boundary point on to at 
each internal point is independent of Re and can be calculated as a 
large matrix once and for all. A boundary condition of this kind was 
used in all the calculations presented below. However, we currently 
use a different condition. A wide two level difference formula can be 
found which is consistent only with the decaying modes of the 
equation £ M' = 0 (as opposed to the usual 5-point 3-level formula 
used inside the region to approximate both growing and decaying 
modes) . 


BOUNDARY CONDITIONS FOR 03 . 


Left boundary: \= 0 ,0 
Bottom boundary: 0 ,0 < £< 2 . 

2 < 5 <? m . 

Right boundary: ,0 


<*> = 0. 

A relation based on 4'^+ w /] e = 0 
and an even function of rj . 

10 = 0 . 


Top boundary: < ? < § M . w = 0. 


The condition at the right boundary comes from the observation 
that the leading term of (8), transformed to 5 -coordinates 
simplifies to 


(9) to 



“ c *9 


where c, and c x are constants. The mapping has achieved a 
separation of variables. 
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The discrete approximations at the interior points together with 
the boundary conditions form, after minor simplifications (explicitly 
eliminating all boundary unknowns apart from ^ at the top 
boundary), a non-linear algebraic system of (M-2)(2N-3) equations 
with equally many unknowns. In most earlier works, great care has 
been taken to ensure that, at this stage, this (or some equivalent) 
non-linear system has a diagonally dominant form for low Re. This 
would allow direct functional iteration to convergence. Techniques 
like upwind differencing [l],[4],[ll] help in this respect at the 
cost of lowered accuracy. Newton’s method, described below, offers an 
outstanding alternative. 


NEWTON'S METHOD. 


Newton's method is a very well known procedure for finding zeros 
of scalar functions. If a function f(x) is given, we can find an x 
such that f(x)®0 by the procedure: 

'close' guess of root 

fUJ 

= X - n s 0, 1 , 2, ... 

f(x h ) 


( 10 ) 


( 11 ) 




The iteration step can he written 


( 12 ) 


f'(xj Ax w = -fU„) 


Known, f' evalu- 
ated at the latest 
available approxi- 
mation x ^ . 


Unknown, the 
correction we 
should apply 


to x „ , i.e. 

V Ax * 




Known, residual. 
Should be zero 
if x n had been 
exact. 


Written in this form, the generalization to systems is 
straightforward. For example the system with three equations in 
three unknowns: 
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( 13 ) 


f(x, y, z) = 0 

g(x, y, z) = 0 

h(x, y, z) = 0 


can be iterated 


r Df 

b f 

Zt " 

r n 

i i 


| rx 

Ty 


! Lx ! 

i i 

f(x ,y ,z ) 

i "b g 

b s 

^ g 

i i 

i i 

g(x ,y ,z ) 

7x 

■$y 

C> Z 1 

l by j = - 

! 'b h 

h 


i i 

i i 

! 1 


l'Tx" 

Ty 


lAz j 

. h(x ,y ,z ) m 


Known, "Jacobian 11 Unknown, Known, 

of system. corrections. residual. 


Each iteration involves the solution of a linear system. Like in the 
scalar case, convergence is quadratic and guaranteed to occur for 
approximations sufficiently close to any 'simple' solution. The 
realization that this procedure is practical for extremely large 
systems (several thousands of equations) is rather recent and linked 
to the emergence of powerful computers. 

For our present problem, use of Newton's method offers several 
major advantages: 

1. The quadratic convergence allows no possibility of 'inheriting' 
temporal instabilities to the artificial time of the iterations. 
Convergence is guaranteed if an isolated solution exists in the 
neighborhood of a guess. 

2. If turning points or bifurcation points are found, they will 
cause no difficulties. 

3. No upwind differencing is needed. This procedure is typically 
employed for two reasons: 
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1 • To ensure convergence of an iterative method. 

2. To avoid mesh size oscillations. 

The first reason no longer applies. The second one alone can 
then be addressed in more refined ways. 

4. Boundary conditions at the body surface become easier to 
implement. The fact that we have two conditions on W and none 
on oo can cause a problem if (5) and (6) are treated separately. 
With Newton's method, all we need is that the number of 
conditions is right. 

The only disadvantage with Newton's method is the computational 
cost. This is where supercomputers enters our picture. 


SOLUTION OF LINEAR SYSTEM 

Let [vfj., j=2,3,***,N be vectors with ^ -values from grid lines 
2, 3, and similarly for *0^ ( j=2, . . . ,N-1 ). For example would 

contain the M* -values along the grid row nearest to the 5 -axis and 
the values along the top boundary. The structure of the entries in 
the Jacobian matrix reflects directly on the difference stencils and 
the boundary conditions. Figure 7 shows a suitable ordering of 
equations and unknowns and the corresponding structure of the 
Jacobian. Since the top right corner contains a single diagonal, 
explicit multiples of the top (N-2)(M-2) equations can be superposed 

on the equations below to modify the structure to the one in Figure 

8. The bottom left corner form a separated system of size (N-l)(M-2). 
This system was solved by a border algorithm similar to the one 
described in [ 9 ] • The major cost comes from the LU-factorization of 
A. However, one more rearrangement can be done to achieve a 
significant speedup. The A-matrix has a block 5-diagonal form with 
the structure shown in Figure 8. A similarity transform with a 
permutation matrix can rearrange this into another matrix of 
identical structure. Instead of N-2 rows of blocks, each of size M-2 , 
we get M-2 rows of blocks of size N-2. With M typically around 6*N 
and cost proportional to the square of the bandwidth, this reduces 

the memory needed for the LU-decomposition about 6 and the operation 

count by 36. 

The complete linear solver lends itself ideally to 
vectorization. Every part of significant cost turns out take a form 
of a 'linked triad' with vectors never shorter than 4(N-2)+1 or M. 
The linked triad on the Cyber 205 is the fastest floating point 


209 



operation the machine offers. Expressions of the form 
vector-op-vector-op-scalar where one ’op* is + or -, the other * can 
execute with both operations running simultaneously. On the 2-pipe 
205, the algorithm has a potential for 200 mflops (million floating 
point operations per second, 64-bit accuracy). Including a startup 
cost of 85 macihne cycles per linked triad operation, average vector 
length of around 166 (which we will exceed in later test cases) could 
give a full 100 mflop overall computational performance. In the 
calculations presented below, the grid had 131 by 21 points. 
Building up the Jacobian (in scalar mode) takes 2.3 seconds and the 
solution of the linear system 3-7 seconds (for an average of 55 
mflops during this part). Recently implemented vectorization of the 
Jacobian and the new boundary condition brings these numbers to 0.026 
seconds, 1.75 seconds and 60 mflops respectively. 


PHYSICAL CONCLUSIONS 


This report is a preliminary one of work in progress. Only a few 
initial test runs have been performed so far. However, we can already 
conclude that the wake appears to continue a linear growth in length 
with increasing Reynolds numbers up to Re = 400. Figure 9 shows wake 
length versus Re for some previous calculations compared with current 
results. Figure 10 shows streamlines and Figure 11 vorticity fields 
for different values of Re up to 400. The vorticity field at Re = 
400 shows a recirculation back into the wake from the end of the 
bubble as well as a quite sudden increase in width. Our most recent 
tests with a computational grid of 196 by 31 points (density 
increased by 3/2 in each direction) leaves these features completely 
unchanged. The onset occurs near Re = 300 and the widening progresses 
at a rate which can be determined accurately and which far exceeds 
the one predicted by available asymptotic models. 

The flow fields in figures 10 and 11 were obtained from a 131*21 
grid in the z-plane with 
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(14) 

( 15 ) 


• • • 


,150 

,20 , *= 0.15 


3 C * i/12 , i=0,1, 

^ ( 1 - 4 *)^, ^ = d /18 , j = 0,1 , ... 

This places the right boundary at a distance 115*4 from the center of 
the cylinder. Preliminary tests involving moving this and the top 
boundaries in and out suggest that they are sufficiently far out with 
the present choice of grid. Figure 4 showed part of this grid. 

The major open questions at the moment ares 

Physically: 

1 . Will the wake keep on growing? 

2. Are there any other branches of solutions (bifurcations 
etc. )? 

Numerically: 

1. Is there any alternative to Newton's method which still 
possesses a reliable rate of convergence? 

2. Is there any faster way than Gaussian elimination to solve 
the linear system in Newton's method? 


At present, the numerical questions are wide open and of 
fundamental importance to many other applications as well. Current 
numerical methods together with vector computers like the Cyber 205 
probably form sufficiently powerful tools to settle conclusively the 
physical questions raised here. 
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Figure l. Length of wake bubble for low Reynolds numbers according to some different calculations 



Schematic illustration of free streamline and 
the Prandtl-Batchelor models. 


Figure 2. 
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Figure 5. True difference between streamf unction and rree stream compared with 
Oseen approximation for Re = 200 . 
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Figure 6. 'typical vorlir 
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Figure 9: Length of wake bubble for different Reynolds numbers. 
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ABSTRACT 

A computer code which solves the Navier-Stokes equations for three-dimensional, 
time-dependent, homogeneous turbulence has been written for the Cyber 205. The 
code has options for both 64-bit and 32-bit arithmetic. With 32-bit computation, 
mesh sizes up to 64^ are contained within core of a 2 million 64-bit word memory. 
Computer speed timing runs were made for various vector lengths up to 6144. 
With this code, speeds a little over 100 Mflops have been achieved on a 2-pipe 
Cyber 205. Several problems encountered in the coding are discussed. 

1. INTRODUCTION 

Turbulent fluid motion is common to many branches of engineering and science. 
Since turbulence phenomena are highly nonlinear, they are not amenable to classi- 
cal analytical approaches. Consequently, turbulence predictions are generally 
based on semi-empirical models. Experiments which generate model information 
are expensive, but are needed because current models are not generally accurate 
enough for engineering purposes. Detailed simulations of turbulent flows can help 
complement laboratory data. Direct numerical simulations of turbulent flows are 
more accurate than current semi-empirical computational methods and can be 
used to both generate physical understanding and to improve the models. In these 
simulations, turbulent flows are directly computed from the Navier-Stokes equa- 
tions. Computations of this type are necessarily three-dimensional and time- 
dependent; they require a large number of grid points, and thus, long computation 
time. The Cyber 205 computer appears ideally suited for efficient numerical 
simulations of this type. Exploration of the use of the Cyber 205 for direct 
numerical simulation of turbulence is a principal objective of this work. 

The basic code was written by one of the authors (RSR). It was modified to take 
advantage of the 205 compiler’s automatic vectorizing capability. Vector syntax 
and special functions were applied to the code segments which could not be auto- 
matically vectorized. Finally, machine language instructions were used for the 
parts of the code that existing compiler could not handle. 
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In the next section, a description of the particular problem to be solved is given. 
In Section 3, the numerical methods used are discussed. This is followed by a brief 
description of the Cyber 205 at Colorado State University. The construction of 
long vectors is discussed in Section 5. In Section 6, performance data obtained to 
date are presented, and in Section 7, problems encountered are described. A typi- 
cal simulation of homogeneous isotropic turbulence is presented in Section 8. In 
the final section, a brief statement of conclusions is presented. 

2. PROBLEM STATEMENT 


Homogeneous turbulent flows, of which there is a considerable variety, can be 
simulated numerically at low Reynolds number without using any turbulence 
model. In the flows we will consider, the computational domain contains a fixed 
mass of fluid within a rectangular parallelepiped, the opposing sides of which can 
move inward or outward with time. Thus, the cases which can be computed are 
quite varied: decaying homogeneous isotropic turbulence is generated if all six 

sides are stationary; turbulence undergoing uniform compression (or expansion) if 
all three pairs of sides move inward (outward) at same rate; turbulence undergoing 
one-dimensional compression, if one pair of sides moves inward; or turbulence 
undergoing plane strain if one pair of sides moves inward at the same rate a 
second pair moves outward, while the third pair remains stationary. Isotropic 
turbulence has been computed before, but turbulence undergoing compression or 
expansion has not. The compression cases are of interest, for example, in internal 
combustion engine modeling and in the interaction of turbulence with a shock 
wave. 


It will be assumed that the Mach number is sufficiently small that the fluid is 
compressed uniformly in space, so that the fluid density depends only on time. 


The governing Navier-Stokes equations for a fluid of uniform viscosity and uni- 
form density in space are: 




9t ru J 




Bj: Uc'j 

112 ^ 

where u j p , v , and t are fluctuating velocity components, fluctuating pressure, 
kinematic viscosity and time respectively. The summation convention is implied. 
This set of governing Navier-Stokes equations allow us to simulate homogeneous 
turbulent flows in Lagrangian coordinate system that moves with the mean flow. 
Coordinate transformation tensor Bij is determined by: 


d t 


+ BcjclT^j^o 


Note that mean strain rate tensor, "Hi, j is zero and Bij=tfij for isotropic homo- 
geneous turbulence. 
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Periodic boundary conditions are applied in all three space directions. The 
velocity field is initialized to an isotropic state that satisfies continuity and has a 
given energy spectrum which approximates that of experimental isotropic tur- 
bulence. 

3. NUMERICAL METHOD 

The spectral method is used to compute all spatial derivatives. This method, 
which uses FFT’s, is good for problems with periodic boundary conditions and has 
very high accuracy. To avoid aliasing in the nonlinear terms, both the truncation 
and phase shifting techniques are used. 

A second order Runge-Kutta method is used to advance the solution in time. 
Thus, all spatial derivatives need to be computed twice each time step. The time 
step was chosen small enough that no significant error is produced. It was deter- 
mined by increasing the step size until the error was approximately 1 percent over 
the full integration period. 

4. THE CYBER 205 

The Cyber 205 we are using is tlm Colorado State machine with 2-pipes and a 2 
million 64-bit word fast memory. QTE Telenet has been used for data transfer 
between Stanford and CSU. We have found that both are reliable, convenient to 
use, and have provided satisfactory service so far. 

Figure 1 shows the performance for add/multiply as function of vector length. 
The asymptotic performance which requires maximum vector length (65535) is 100 
Mflops for 64-bit arithmetic and 200 Mflops for 32-bit arithmetic. 

It is obvious that the performance improves with vector length. Vector length 
1000 (64-bit case) or 2000 (32-bit case) is required to reach 90 percent of the 
asymptotic performance. Constructing a code which uses long vectors is there- 
fore important if maximum performance from the machine is to be obtained. 

5. DATA MANAGEMENT 

Based on the "longer vector gives better performance" philosophy, we chose to do 
the Fourier transforms in parallel. This will be explained in detail later. 

In Figure 2, NX, NY, and NZ are the number of mesh points in the x, y, and z 
directions respectively; MY and MZ are called "pencil sizes". 

On the first sweep, MZ x-y planes of data are Fourier transformed in the y direc- 
tion in parallel. The transform length is NY, but by doing them in parallel, a 
vector length of NX/2*MZ*3 is achieved; the factor 3 is due to the simultaneous 
processing of three velocity components, and the factor 1/2 is due to only half of 
the modes are needed in wave space to represent a real function in physical space. 
To accomplish this, it is useful to lump every dependent variable into a single big 
array. The main array in our code is DATA(NX/2,NY,NZ,4,2); the dimensions 
represent x, y, z, a dependent variable index, and real and imaginary parts of a 
complex number. 


229 



VECTOR ADD/MULTIPLY MFLOP RATE 
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VECTOR LENGTH 




On the second sweep, MY x-z planes are processed. Fourier transforms in z and x 
directions are done on this sweep. The vector lengths are NX/2*MY*3 and 
NZ*MY*3 respectively. 

A Cyber 205 vector is defined as a contiguous set of memory locations. Since the 
two sweeps are in different directions, an array trans^c e has to be done between 
sweeps and within the second sweep in order to keep processed data in a 
contiguous set of memory locations. The transpose is done by using gather 
instructions. The gather instruction puts array elements which sire at various 
locations into a contiguous set of memory locations. An index vector is needed to 
pick up desired elements. Q8VGATHR function (64-bit) or Q8VXTOV subroutine 
(32-bit) is used to do the transposing. As the array gets bigger, so does the index 
vector length, and an appreciable amount of overhead working space is needed. In 
the 64 3 (32x16) run, the index vector has 17,408 elements. 

6. COMPUTER PERFORMANCE 


The performance data obtained to date, based on a hand count of the number of 
operations per time step, are presented in Table 1. The mesh size is given in 
column 2 (each node requires 7 words of data storage). The pencil size is given in 
column 3; this, together with mesh size, determines the vector length shown in 
column 4. The computational precision is given in column 5, the CPU time in 
column 6, and the CPU computation rate in column 7. The I/O time per step in 
seconds is meaningful only for runs with virtual memory paging. Explicit I/O 
would reduce I/O time considerably, but we have not yet attempted to use explicit 
I/O. 


Figure 3 shows computation rate as function of vector length for our code on the 
2-pipe CSU Cyber 205. It approaches an asymptote as vector length increases. 

Comparing Runs 3 and 4, and Runs 5 and 6 in Table 1, it is found that the CPU 
time for a 32-bit (half) precision run is 60 percent of that for the corresponding 
64-bit (full) precision run. We kept track of the timing in the transpose part of 
the code and found an interesting fact. In full precision runs, the transpose takes 
15 percent of the CPU time; 85 percent of the CPU time is spent in floating point 
operations. In half precision runs, due to the lack of a half precision gather 
utility, the transpose takes the same amount of time as in full precision runs, 
while the floating point operations require only half of the full precision CPU 
time. Consequently, for half precision run, the transpose takes 25 percent of the 
total time. 


Detailed timing from Run 8 shows that 51 percent of the CPU time is spent in the 
FFT subroutine, which contains 78 percent of the floating point operations. In 
other words, the FFT operates at 157.6 Mflops. The remaining 22 percent of the 
floating point operations are executed at 95 Mflops due mainly to shorter vector 
lengths and IF statements. 

7. PROBLEMS ENCOUNTERED 


Runs 7 and 8 of Table 1 require 3.5M words storage, and hence, do not fit within 
the 2M core memory at CSU with full precision. Thus, we must use 32-bit com- 
putation for efficient use of the CSU Cyber 205. Half-precision computation is 
sufficiently accurate for this code, and twice the operating speed is achieved. 
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TABLE 1.— PERFORMANCE OF CYBER 205 AT CSU 
(2 PIPES WITH 2M 64-BIT WORD) 
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4 

Since there is no compiler available -yet for half precision gather/scatter calls, 
we have to use special Q8 calls 5 (machine instructions) to get the half precision 
code to compile properly on the CSU Cyber 205; the special Q8 instructions exe- 
cute at full precision speed. Mr. Herbert Rothmund of CDC Sunnyvale was most 
helpful to us in providing these utilities. 

It is apparent that the I/O rate is not balanced with the CPU time. The reason is 
that the CSU Cyber 205 has only two channels to transfer data between fast 
memory and disk and they are inherently slow. Solid-state backing memory (or 
equivalent) would speed ua the data transfer rate. For our problem, faster I/O 
would allow us to go to 128'* mesh size. 

Since December 1982, three different compilers have been used: cycles 201109, 
L575, and 575B. Cycle 201109 did not have the half precision feature. Cycle 
L575 had half precision but lacked some automatic vectorization features. Cycle 
575B, the most recent version, does not have gather/scatter in half precision. 
Further improvements are needed if users are to get optimum performance from 
this machine. 


8. SIMULATION OF ISOTROPIC HOMOGENEOUS TURBULENCE 


A typical simulation of homogeneous isotropic turbulent flow is presented in this 
section. Figure 4 shows the time history of the three-dimensional energy spec- 
trum from initial time step to 300 time steps. Figure 5 shows the 3-D spectra of 
the components of the turbulent kinetic energy at time step 300. The flow is 
slightly anisotropic at low wavenumbers. This is due to the small number of 
modes at low wavenumbers. 


All of these results are in excellent agreement with both experiments and pre- 
vious simulations. Thus, we are confident that the code is performing satis- 
factorily and we will proceed to the simulation of compressed flows. The code 
presently runs at 1.9 second per time step for a 64 J mesh on the 2-pipe Cyber 205; 
this compares with 5 seconds for the same type of code on the CRAY-1S in 
VECTORAL language. 


9. CONCLUSION 


In summary, we have written, debugged, and tested a code for solving the Navier- 
Stokes equations and for computing various turbulence statistical quantities. Most 
of the operations are readily vectorized, and 100 Mflops has been obtained for 64 J 
mesh size in-core runs on a 2-pipe Cyber 205. The major problems encountered so 
far are concerned with the lack of compiler utilities, such as half-precision com- 
piling capability for transpose operations. 

The program works well and has been validated for homogeneous isotropic tur- 
bulence. The code will next be used to help develop turbulence models for com- 
pressed flow in engines. 
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NOMENCLATURE 

By Coordinate transformation tensor 

MY Pencil size in Y-direction 

MZ Pencil size in Z-direction 

NX Number of mesh points in X-direction 

NY Number of mesh points in Y-direction 

NZ Number of mesh points in Z-direction 

p* Pressure fluctuations 

t Time 

tf. . Mean strain rate tensor 

bJ 

u. Velocity fluctuations in i-direction 

x Space coordinate 

y Space coordinate 

z Space coordinate 

aij Kronecker delta 

Kinematic viscosity 
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Abstract 

Many important algorithms for solving problems in linear algebra require 
the repeated computation of the matrix-vector product b - Ax where A is 
symmetric and sparse. Examples are the conjugate gradient and Lanczos 
methods. 

This work has been directed toward the development of an efficient 
algorithm for performing this computation on the CYBER- 203. The desire to 
provide software which gives the user the choice between the often conflicting 
goals of minimizing central processing (CPU) time or storage requirements has 
led to a diagonal-based algorithm in which one of three types of storage is 
selected for each diagonal. For each storage type, an initialization sub- 
routine estimates the CPU and storage requirements based upon results from 
previously performed numerical experimentation. These requirements are 
adjusted by weights provided by the user which reflect the relative importance 
the user places on the two resources. 

The three storage types employed were chosen to be efficient on the 
CYBER- 203 for diagonals which are sparse, moderately sparse, or dense; 
however, for many densities, no diagonal type is most efficient with respect 
to both resource requirements. The user-supplied weights dictate the choice. 

Introduction 

Many of the important numerical techniques used today to solve linear 
equations require repeated computation of a symmetric matrix times a vector. 
Examples are the conjugate gradient method, with all its variants, for solving 
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simultaneous linear equations (refs* 1 and 2) and the Lanczos algorithm for 
eigenvalue and eigenvector extraction (ref. 3)* These methods are 
particularly attractive when the matrix is sparse since, unlike direct 
methods, they do not require storage of the entire matrix. The matrix is only 
used to multiply a vector and to do this one only needs to know the nonzero 
elements and their position within the matrix. 

The primary objective of this work has been to develop software for the 
CYBER- 203 that provides an efficient means for computing b » Ax when A is 
an n * n, symmetric, sparse matrix. 

Because use of vector hardware instructions on a vector processor has 
very definite implications about the storage, a user's desire to minimize both 
the required central processing unit (CPU) time and the total storage needed 
to represent A are often conflicting goals. Thus, a more specific objective 
of the work has been to design the software so that it provides alternative 
storage/computational procedures for the matrix A and automatically selects 
the procedure which best reflects t}ie users relative concerns about minimizing 
the two resources . 

These objectives have led to the development of a diagonal-based storage 
and computation scheme in which a preprocessing subroutine, CMPACT, chooses 
one of three storage methods for each diagonal using CPU and storage estimates 
and user-provided resource weighting information. The subroutine, CMXV, can 
be called repeatedly to compute Ax using the compact form of matrix A. 

Subsequent sections of the paper will describe the relevant CYBER- 203 
instructions used, the diagonal-based algorithm with the tradeoffs between the 
methods, a description of the implementation used, and results for several 
sparse matrices. 
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CYBER— 203 Characteristics 


The CYBER— 203 at Langley Research Center is a vector processing computer 
capable of producing 50 million floating point results (64 bit) for a vector 
addition and 25 million for a vector multiplication. It has one million words 
of bit addressable central memory in a virtual memory architecture. 

The high CPU rates are achieved by operations on long vectors whose 
components, by definition, are consecutively stored in memory. However, if 
vector lengths are short (say, 50 or less), the fast scalar capability makes 
serial computation superior. 

In addition to the usual arithmetic operations (+, *, and *), several 

nontypical hardware instructions exist which proved useful in this work. 

These were the vector compare, compress, expand, and bit count. Figure 1 
demonstrates their use. 
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(a) Compare vector not equal to 0; result to bit vector, B; count "on" bits 
in B. 



(b) Compress vector by bit vector. 



(c) Expand compressed vector by bit vector. 


Figure 1. CYBER-203 nontypical vector instructions. 
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Diagonal-Baaed Matrix Multiplication 
It is possible to describe the multiplication process b * Ax for a 
matrix A in terms of elements of each diagonal* Let Alt) denote the 
4 th superdiagonal (also the 1 th subdiagonal since A is symmetric) and let 
Ay.lt) be the k * 1 * 1 component. That is, A^Ct) ■ **^+4 * a k+i,k* Tho procedure 


for computing b - Ax for the n x n matrix A is 
b k + A^o) x fe k - 1 , 2 , ...,n. 

For t - 1 , 2 , • • • , n-1 . 

b^ <- b k +> A^lt) for k * 1 f 2, • • • ,n-t (1) 

V* Vi + *ie tar k • 1 ' 2 n "* (2) 

End F 


Note that if A is banded, t need only go from 1 to the bandwidth 0 
and that if any diagonals are identically zero, they can be easily identified 
and all computation for them in (1 ) and ( 2 ) can be omitted^ 

The diagonal-based scheme has been selected as the foundation for this 
work for several reasons: 

a. Nonzero structure of real problems - Many matrices arising from finite 
difference or finite element formulations naturally lead to a sparsity 
pattern in which most of the nonzeros lie along a few of the diagonals. 

The 5 diagonal matrix arising from central differencing of Poisson's 
equation is an extreme example. Of course, there the pattern is so pre- 
dictable that special storage techniques are not needed; but for irregular 
grids, or more complex equations with more complicated differencing, the 
sparsity is not so easily specified. This is especially true in finite 
element formulations where one of the strengths of the method is the 
ability to use nonuniform elements. 
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b. Vectorization - The n - l multiplications and additions in equations (1 ) 


and (2) can be carried out by vector operations of length n - l. 
c. Symmetry of diagonals - The l “ subdiagonal is also the i super- 
diagonal, Since equations (1) and (2) are identical in form, the storage 
and computation most appropriate for the subdiagonal is also most appro- 
priate for the superdiagonal. 

Storage Tradeoffs 

The vector computations implied in equations (1) and (2) assume A(£) is 
available as a vector of length n - l. However, if the diagonal is rela- 
tively sparse, one might not want to store the entire diagonal with all its 
zeros. In fact, if the diagonal is very sparse, neither vector storage nor 
vector computation is likely to be very efficient. 

Described below are three types of diagonal storage and their associated 
computation to execute equations (1) and (2). 

Full Vector (Type 1) - Here the entire diagonal is stored including any 
zeros. Vectors of length n - £. are used. This mode will be most 
efficient when A(Z) is very dense. 

Compressed Vector Plus Bit Pattern (Type 2) - Here only the nonzeros are 

stored along with a bit vector to give positional information within the 
diagonal. The computation is identical to that with type 1 diagonals 
after an expand is performed to generate the full diagonal A (Z). The 
extra expand makes type 2 CPU requirements always exceed type 1 , but the 
storage can be considerably less. 

Compressed Vector Plus Row Pointers (Type 3) - Here the assumption is that 
A(Z) is so sparse that it will be inefficient to expand the compressed 
vector. Equations (1) and (2) are executed serially making use of the row 
indices stored for positional information. 
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Figures (2) and (3) show the CPU and storage requirements for a diagonal 


of length 1000 as a function of density, A comparison of the two figures 
shows that, unfortunately, one cannot identify intervals of density where a 
particular diagonal type is most efficient with respect to both resources. 

For instance type 3 CPU is least for d < 0.11 but has a greater storage 
requirement than type 2 for d > 0.02. Even in those regions where one 
diagonal type is most efficient for both resources (tyjae 1 for very dense and 
type 3 for very sparse), the boundaries of these regions vary with the length 
of the diagonal. 

Since the minimization of both resources is frequently not possible, and 
since different users may attach different importances to the two resources, 
it was decided to let the user influence the storage selection through 
resource weighting factors. To implement this the initialization subroutine, 
CMP ACT, does the following for each diagonal: 

( 1 ) Estimates the CPU and storage requirements for each of the three candidate 
types. 

(2) Applies a user-supplied weight to compute the weighted resource require- 
ment for each method. 

(3) Selects the storage type that minimizes the sum of the two weighted 
resource requirements • 

That is, denoting the predicted storage and CPU requirements for the j th 


diagonal type by Sj and c^ respectively, their minimum by s m and c m , the 
users specified weighting by s w and c w , then the normalized and weighted 
resource, rj, for the diagonal type is computed as 



j - 1,2,3 


Subroutine CMPACT computes Tj and selects the diagonal type which yields the 
minimum value of r. 
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FIGURE 2. CPU TIME FOR DIAGONAL WITH LENGTH 1,000. 



FIGURE 3. STORAGE REQUIREMENTS FOR DIAGONAL WITH LENGTH 1,000. 
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For this approach, CMPACT must be able to estimate Sj and Cj for 
all n and d. The storage estimates are easily made in terms of a diagonal 
of length n having z nonzeros. 
s 1 = n 
s 2 = z + w 
s ^ = 2z 

where w is the least number of 64-bit words needed to hold n bits. 

The CPU estimates were obtained by timing the computation for a range 
of n and density d. For' type 1 and 3 diagonals, single formulas were 
obtained, but the complexity of the expand used in type 2 diagonal computation 
required a table of values. The time in .microseconds to perform the computa- 
tions implied in equations (1) and (2) for a single diagonal can be estimated 
by 

C 1 =* 29 + 0.1 22 n 
C 2 = See Table I 
C 3 = 7 + 1.74 z 

Since these values are used only in a selection process, their accuracy 
to a percent or two is sufficient. 


Table I.- Type 2 diagonal CPU times (microseconds) as a function 
of diagonal length n and density d. 



d 

n 

0 . 

.1 

.2 

.4 

.6 

.8 

1 .0 

1 00 

53 

53 

53 

57 

60 

63 

68 

l 

500 

1 23 

1 23 

1 24 

1 41 

1 60 

176 

197 

5000 

901 

901 

918 

997 

1 1 34 

1 280 

1 429 
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Implementation 


The matrix is received in subroutine Q4PACT in its expanded form as an 
N by IB array. Each of the IB diagonals is treated individually as the 
compact representation, array C, is formed. C is a linear array in which 
the pertinent data for the diagonal is stored behind that for the L - 1 st 
diagonal. As illustrated in figure 4, this can be, for types 1 , 2, or 3 
respectively, either the entire diagonal, the nonzero bit pattern for the 
diagonal followed by the nonzeros, or the nonzeros and index data. A vector 
compare with broadcast zero generates the bit pattern and provides the number 
of nonzeros and density. If the weighting procedure determines that the 
diagonal should be type 2 or 3, a compress is performed. In addition, two 
integers for each diagonal are stored in a separate array. The first identi- 
fies the diagonal type and the second the number of nonzeros in the diagonal. 

The subroutine returns to the user the CPU and storage estimates for the 
user provided weights. In addition the estimates for combinations s w * 1 , 

=* 0 and s w * 0, c^ * 1 are returned to aid the user to adjust his weights 
in subsequent computations. 
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Figure 4 - Storage for A(£) (n - l = 6) • 


Results 

Results from two test matrices are presented here to demonstrate the 
effect and control the user has on the matrix storage and computational 
requirements by giving the statistics for different combinations of s w and 
c w . Refer to Tables II and III. 

Case 1 - This is a randomly generated matrix with 400 equations and a 
bandwidth of 21 . The densities are approximately uniformly distributed 
between 0. and 1. The average density is 55.7%. The storage selection that 
minimizes the CPU time (1.57 msec; mostly type 1) yields the largest storage 
requirement. The selection to minimize storage (4713 words; mostly type 2) 
yields the largest computation time. 
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Case 2 - This is a sparse matrix resulting from a finite element formula- 


tion with triangular elements and 3 degrees of freedom at each node. The 
matrix has 1086 equations, a bandwidth of 81, and an average density of 7.8%. 
Most of the diagonals are sparse. Of the 81 diagonals, 57 are less than 5% 
dense and approximately half of the nonzeros are on the four diagonals closest 
to the main diagonal. Because of the relatively few dense diagonals, most of 
the diagonals are type 2 (to minimize storage) or type 3 (to minimize CPU) . 

Both examples demonstrate the conflicting goals of minimizing both 
resources. They also show that use of the weighting factors can give the user 
a rather wide range of resource distributions. For instance, in the second 
example a weighting of 1 for c w leads to a CPU time that is minimum but a 
storage requirement which is 1 .73 times that if one set s w « 1 . However, 
setting s w =* 1 yields a CPU time which is 2.6 times the minimum. A reason- 
able middle ground occurs when 3 W 3 c w = 0.5. In this case, the CPU is 1.09 
times the minimum and the storage is 1.2 times the minimum. 
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Table II,- Case 1; 21 x 400 random matrix 


Weights 

Resources 

Diagonal Selection 

c w 

s w 

CPU 

(Secs) 

Storage 

1 

2 

3 

0 

1 

.00271 

471 3 

1 

20 

0 

.3 

.7 

.00217 

4950 

7 

1 3 

1 

.5 

.5 

.00193 

5481 

1 1 

9 

1 

.7 

i 

.3 

.001 74 

6053 

1 4 

5 

2 

1 

0 

.001 57 

7495 

19 

0 

j 

2 


Table III,- Case 2; 81 x 1086 finite element matrix* 


Weights 

Resources 

Diagonal Sele 

ction 



CPU 





c w 

s w 

( Secs ) 

Storage 

1 

2 

3 

0 

1 

.01 680 

8032 

1 

72 

8 

.3 

.7 

.00800 

9200 

3 

1 7 

61 

.5 

.5 

.00703 

9622 

3 

8 

70 

.7 

.3 

.00682 

9820 

3 

4 

74 

1 

0 

.00646 

1 3883 

8 

0 

73 
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Summary 


This paper has described a computational and storage algorithm for sparse 
matrix multiplication on the CYBER-203. The multiplication is performed using 
diagonals of the matrix as the candidate vectors since this is where nonzero 
patterns predominate in many scientific applications. Three types of diagonal 
sparsity patterns are identified (roughly speaking, either dense, moderately 
sparse, or sparse) and storage and computational procedures developed for 
each. 

Since, for most densities, no single diagonal type minimizes both storage 
and CPU requirements, an initialization subroutine selects the most 
"efficient" type for the diagonal based on estimated resource requirements and 
user-provided weights which indicate the relative importance the user attaches 
to each resource. 

Examples are given which illustrate that, for a given matrix, the weights 
can be used to achieve minimal CPU time (at the expense of storage) or minimal 
storage (at the expense of CPU time) or some compromise between the two. 
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The compuTatlonal aspects of modeling material failure in structural wood 
members are presented with particular reference to vector processing aspects. 
Wood members are considered to be highly orthotropic. Inhomogeneous, and 
discontinuous' due to the complex microstructure of wood material and the pres- 
ence of natural growth characteristics such as knots, cracks and cross grain 
in wood members. The simulation of strength behavior of wood members is 
accomplished through the use of a special purpose finite element/fracture 
mechanics routine, program STARW (^Irength Analysis Routine for Jdood). Pro- 
gram STARW employs quadratic finite elements combined with singular crack tip 
elements In a finite element mesh which accounts for the complexities Inherent 
In wood structural members. The need to use a highly refined finite element 
mesh to adequately model material behavior, results In the formulation of 
tnousands of simultaneous equations which must be generated and solved repeat- 
edly to model the nonlinear failure process which occurs. The availability of 
the CYBER 205 at Colorado State University has made implementation of program 
STARW at the level described not only possible, but also relatively economi- 
cal. Vector processing techniques are employed in mesh generation, stiffness 
matrix formation, simultaneous, equation solution, and material failure calcu- 
lations. The paper addresses these techniques along with the time and effort 
requirements needed to convert existing finite element code to a vectorized 
version. Comparisons In execution time between vectorized and nonvector I zed 
routines are provided. 
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INTRODUCTION 


Accurate knowledge ot the strength of a structural member Is essential 
Information to the design engineer concerned with structural safety and effi- 
cient material use. A means to predict material strength Is necessary, since 
all materials exhibit some variability In strength and It Is not feasible to 
pnyslcally test every structural member to determine Its load carrying capa- 
city. The sophistication of strength prediction models have generally 
advanced, not only with the discovery and refinement of new computational 
methods, but also w h the Increase In computer capabilities which enable 
erflclent application of the new methods. 

In the case ot wood structural members, the current strength prediction 
method Is a highly approximate procedure based on empirical concepts from the 
1930's. This results In a strength prediction that Is relatively uncertain. 
The current strength prediction procedure Is based on the results of physical 
tests because until now It has not been possible to mathematically model wood 
member failure and rationally predict strength. The most obvious difficul- 
ties; orthotropic material properties, the presence of knots and associated 
grain deviations, and the presence of cracks from seasoning and partial 
material failure, can now be successfully modeled with program STARW ( STrength 
Analysis fioutlne for Jtfood) (2). 

The nature of the nonlinear failure modeling process, presents a computa- 
tional problem of such a large magnitude that It can not be efficiently 
accomplished on computers that do not have the capacity of a CYBER 205. Pro- 
gram STARW represents a case where modest effort In Invoking vector processing 
syntax has not only made Implementation of the program possible, but has also 
resulted In a relatively economical solution. 
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AN OVERVIEW ON- MATHEMATI CALLY MODELING WOOD MEMBER FAILURE WITH STARW 


Program STARW uses two-dimensional orthotropic finite elements to model 
behavior In the I ong I tud I na I -transverse plane of a loaded wood member. Ten- 
sile load Is applied In the longitudinal direction as shown In Fig. 1. 



Figure 1. Loaded Wood Structural Member 
( Long I tud I na I -Transverse Plane) 

A knot In a structural specimen of wood creates localized grain deviation 
as Indicated In Fig. 1. This grain deviation has an extremely Important 
effect on stress distributions at locations near the knot (3). An Iterative 
procedure to locate mesh coordinates corresponding to the grain deviation 
around a knot Is employed In program STARW. This procedure relates distortion 
of wood grain around a knot to streamlines of laminar fluid flow around an 
elliptical object and has therefore been named the "flow-grain analogy" (4). 
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Utilizing the flow-grain analogy, a representative finite element mesh Is 
automatically constructed of eight node quadrilateral elements, six node tri- 
angular elements, and eight node singular elements. Since tangential elastic 
stiffness of wood may be as little as 1/20 of the longitudinal elastic stiff- 
ness, all three types of finite elements are required to model different elas- 
tic material behavior In the longitudinal and tangential directions. 
Appropriate elastic stiffness values for each element are automatically 
assigned. 

Singular elements are used to model material behavior around the tip of 
cracks that form as the load on the member Is Increased. These elements were 
developed using theory from linear elastic fracture mechanics (1). Experimen- 
tal Investigations have Indicated that cracks In structural lumber will usu- 
ally form ana propagate along a grain line.- Thus, cracks are modeled by pro- 
gram STARW by "unzipping" the finite element mesh along the material separa- 
tion ana placing the singular elements around the crack tip. A resulting fin- 
ite element mesh Is shown In Fig. 2. The "unzipping" process and placement of 
tne singular elements are performed automatically upon cue by the user when 
the appropriate failure conditions are Indicated In the program output. 
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Direction of Applied Stress 



Around the Crock Tip 

Figure 2. Examp I & Finite Element Mesh Including Crack 
The output directly calculated from each analysis Is as follows: 

1) Horizontal and vertical displacement at each node In the mesh. 

2) Stresses for each element, paral I e'l-to-gral n, perpendlcu lai — to- 
grain, and shear. 

3) Stress Intensity factors resulting from the use of singular ele- 
ments. 

4) A failure summary that Indicates to the user what appropriate 
action should be taken to model the next step In the failure pro- 
cess. 


263 





The stress Intensity factors directly reflect the strength of the stress field 
around the crack tip. The stress Intensity factors are compared within the 
program to a fracture criteria for structural wood members to determine If the 
existing crack propagates at a given applied load. The element stresses are 
compared to a failure criteria for structural wood members to determine If a 
crack will form near the element under consideration. The results of these 
comparisons are expressed In the failure summary. 

Analyses are performed repeatedly with stress and stress Intensity fac- 
tors monitored at each step and compared within the program logic to the 
fracture/fal 1 ure criteria. As the load on the member Is increased, more 
cracking and material failure occurs. The user, based on the Information In 
the failure summary and the overall stress picture, gives the program the 
necessary Information to model the successive step In the failure process. In 
the future, as research progresses, program logic will be expanded to Include 
the decision making process the user currently makes based on the failure sum- 
mary. Failure may be continually modeled In this fashion until the member 
unoer consideration has failed to the point where It cannot resist an Increase 
In load. At this point, the predicted strength Is realized. In studying the 
behavior of a wood member, 30 analyses may typically be performed before the 
member reaches Its capacity. A simplified diagram of the failure model Is 
contained In Fig. 3. 
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Figure 3. Strength Prediction Model 


ASPECTS AND IMPLICATIONS OF VECTOR LZAT I DM 

For each analysis, program STARW performs five general sets of computa- 
tions: 

1) Generation of a suitable finite element mesh using the flow-gratn anal- 
ogy and an unzipping process to Include cracks. 

2) Formation of a set of simultaneous equations which may be 2000 to 5000 
equations In length. 

3) Solution of the simultaneous equations using Gauss elimination. 
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4) Calculation and coordinate transformation of element stresses based on 
the solution vector and the element grain angles. 

5) Computations with the fal lure/ fracture criteria using element stresses 
and stress Intensity factors as Input. 

Routines Included In Items 1 through 4 existed In limited form and were 
executed for small problems on a CYBER 720 prior to application on the CYBER 
205. Failure calculations In Item 5 and additional mesh generation capabili- 
ties were added and designed specifically for use on the CYBER 205. After 
compiler Induced vectorlzatlon proved to be Inadequate, In significantly 
reducing execution time. It became apparent that it was essential to expli- 
citly vectorize selected portions of the existing routines. At the same time. 
It was not the primary goal of the project to expend unlimited effort to 
achieve the maximum In vector processing, rather the goal was to produce a 
powerful research tool that could be economically Implemented. The bulk of 
the conversion (and execution time savings) were achieved with modest effort 
after becoming familiar with vector processing syntax. 

To date, a means to vectorize the Iterative solution of the fluid mechan- 
ics equations contained In the flow-grain analogy has not been established. 
This Is not of great concern since, as In many finite element routines, mesh 
generation does not account for a significant portion of the total execution 
time. However, the unzipping of the finite element mesh to model cracks 
Involves, In part, a uniform renumbering of nodal points. This renumbering is 
easily accomplished with basic vector commands since nodal coordinates are 
stored In vector form. 
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Formation ot the set of simultaneous equations can typical ly take from 5 
to 50 per cent of total execution time In a unvectorized finite element 
analysis. In program STARW, a 16 by 16 element stiffness matrix must be con- 
structed for each element and properly combined with other element stiffness 
matrices to form the coefficient matrix (global stiffness matrix) of the 
simultaneous equations. Formation of the 16 by 16 matrix Involves dot pro- 
ducts or vectors of length 16. Some time savings Is attained here through the 
use of the CYBER Q0SDOT command even though the vector length Is rather small. 

Solution of the simultaneous equations typically requires 40 to 90 per- 
cent of the total execution time of a finite element analysis. The 90 percent 
figure Is not uncommon for large two-dimensional analyses. Therefore, large 
time savings can be attained by vectorizing the solution algorithm alone. In 
program STARW, Gauss elimination Is used to. decompose the global stiffness 
matrix, followed by a back substitution to obtain the solution. For the prob- 
lem under consideration the stiffness matrix Is banded and symmetric, and 
therefore, only the upper diagonal half of the matrix Is stored. Furthermore, 
If the global stiffness matrix Is stored In columns rather than rows, then 
adjacent terms In a row of the global stiffness matrix will be stored contigu- 
ously. Since Gauss elimination Involves operations of one row upon another, 
by storing the matrix as described, each row will be a vector. "Gather" and 
"scatter" vector formation commands are unnecessary. Gauss elimination 
Involves operations on the matrix rows In a number of nested DO loops. Vector- 
Ization of even the Inner most loop results In large time savings. Back sub- 
stitution Involves repeated dot products of previously formed vectors. This 
can again be easily accomp 1 1 shed w I th the CYBER Q8SD0T command. An unvector- 
ized and otherwise Identical vectorized portion of the back substitution Is 
shown In Fig. 4 to Illustrate typical vector I z at I on. 
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DO 460 J * 2, JEND 
J1 • M + J - 1 

B( 1 1) = B( 1 1 ) - A(J/ 1 1 ) * B(J1) 
460 CONTINUE 


LE * JEND - 1 
J1 = II + 1 

B(I1) » B(I1) - Q8SD0T (A<2, 11; LE), B(J1; LE)) 

Figure 4. Example DO Loop and Corresponding Vector Syntax 

With the solution of the equations established, element strains and 
stresses can be calculated In global coordinates. Since this calculation Is 
essentially the same for every element, and care Is taken to store the neces- 
sary quantities In vector form, basic vector operations accomplish this task. 
The solution vector Is found In the global coordinate system and thus the cal- 
culated stresses are also expressed In this system. It Is deslreable, however, 
to know the stresses In the coordinate system of each element or the 
perpend leu I ar-to-gra In and paral lel-to-graln directions. The element stresses 
must be transformed according to the element grain angle. Since the element 
grain angles are stored contiguously and In order, this computation can be 
accomplished with basic vector commands. 

To complete an analysis, the stresses and stress Intensity factors for 
cracks must be Inserted Into the fal lure/ fracture criteria. The 
fai lure/fracture criteria Interfaces the mathematical results from an analysis 


268 



to the real life failure actions. Required Information Includes the maximum 
stresses and their locations within the flow-grain mesh. Since stresses are 
stored In element order In vectors, this Information can be obtained much 
quicker and more easily by using CYBER Q8 commands than with scalar search 
al gorl thms. 

To put the vector I zat I on discussed Into perspective, a typical problem 
was analyzed using unvectorized and vectorized routines. Since unvectorized 
versions of the mesh generator (Item #1) and the maximum stress searching rou- 
tine (Item #5) do not exist, vectorized routines had to be used for both sides 
of the example. The example problem consisted of 4180 degrees of freedom 
(equations) and for simplification no cracks were Included. The corresponding 
CPU execution times for different phases of the analysis are shown In Table 1. 


Table 1. Efficiency of Execution Time for Vectorized Routines 



Unvectorized 

TIME IN SEC. 

Vectorized 

TIME IN SEC. 

Efficiency 

UNVECT/VECT 

Mesh Generation 

1.90 

1.90 

1.00 

Stiffness Matrix Formation 

4.84 

2.80 

1.73 

Solution of Equations 

97.87 

4.91 

19.90 

Ml SCELLANEOUS COMPUTAT 1 ONS 

5.05 

4.60 

1.10 

Total 

109.66 

14.21 

7.70 


As clearly shown for this problem, the vectorized equation solver was 20 
times faster than Its otherwise Identical unvectorized version. This savings, 
along with other vector I zat I on, reduced analysis time by nearly a factor of 
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eight. One wilt note that while the miscellaneous computations were somewhat 
Insignificant In the unvectorized analysis, they take on new importance in 
tne vectorized analysis. Additional effort may be well spent In further vec- 
torlzatlon of the miscellaneous computations. 

amcLusi Q NS 

Failure In wood members Is being successfully modeled and analytically 
Investigated In greater detail than before possible through Implementation of 
program STARW on the CYBER 205 (2). An understanding of material failure Is 
essential to accurately predict member strength and to safely and efficiently 
use tne material In engineering application. 

Vector Izatlon of program STARW has reduced an unwieldly and expensive, 
nonlinear failure modeling method Into an efficient research tool. Vector I za- 
tlon of existing routines need not be a lengthy and laborious effort to 
achieve execution time savings. It has been shown that careful organization of 
operands Into vectors and modest effort In Invoking vector syntax can cut pro- 
gram execution time by a factor of nearly 8 for a typical problem In this 
research. The largest savings Is realized In the solution of the simultaneous 
equations. 

While use of program STARW Is expected to provide new Information on 
fracture and failure In wood members, the availability of machines with the 
capabilities of the CYBER 205, In general holds promise for advances In the 
analytical modeling of all materials. These advances in research will Ini- 
tiate new applications of materials and more efficient and reliable use of 
materials in existing applications. 
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ABSTRACT 


Very efficient algorithms for solving large sparse systems of 
simultaneous linear equations have been developed for serial 
processing computers. These involve a reordering of matrix 
rows and columns in order to obtain a near triangular pattern 
of non-zero elements. Then an LU factorization is developed to 
represent the matrix inverse in terms of a sequence of 
elementary gaussian eliminations, or pivots. 

In this paper we show how to adapt these algorithms for 
efficient implementation on vector processors. Results 
obtained on the CYBER 200 Model 205 are presented for a series 
of large test problems which show the comparative advantages of 
the triangularization and vector processing algorithms. 
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Preliminary Results in Implementing a Model of the 
World Economy on the Cyber 205: A Case of Large 

Sparse Nonsymmetric Linear Equations 


Abstract 


Daniel B. Szyld 
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A brief description of the Model of the World Economy 
implemented at the Institute for Economic Analysis is 
presented, together with our experience in converting the 
software to vector code. 

For each time period, the model is reduced to a linear 
system of over 2000 variables. The matrix of coefficients 
has a bordered block diagonal structure, and we show how some 
of the matrix operations can be carried out on all diagonal 
blocks at once. 

We present some other details of the algorithms and 
report running times. 
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1 . Description of the Model 


The first input-output model of the world economy was 
originally developed for the United Nations by Leontief, Carter 
and Petri [1977] as a tool for evaluating alternative long-term 
economic policies. The most recent version that has been 
implemented spans the period 1970-2030 in 10-year intervals. 

The model is dynamic in the sense that the solution for each 
10-year period requires information obtained from the solution 
for the previous period. In this paper we focus on the solution 
of a single time period. 

In the current version of the model, the world is divided 
into 16 regions (r=16) and for each of the regions the detailed 
economic activities are described by a set of linear algebraic 
equations of the form 

A iZ"i + Siw = 0 (i = l,...,r). (1) 


The components of the vectors correspond to levels of 
domestic production, imports, and exports of goods and ser- 
vices, and so on, for each region, and w is the vector of 
total world exports. In addition there are global constraints 
described by the equation 


r 

E 

i = l 


G iXi 


0 


( 2 ) 


which imposes the consistency among regional trade relations. 

A more detailed description of the model can be found in 
Leontief, Carter and Petri [1977], Duchin and Szyld [1979], and 
Szyld [1981]. 
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All the matrices involved are very sparse. For example 
A could be 200 x 250 with 2500 nonzeros. 

could be 200 x 50 with 50 nonzeros. 

Gi could be 50 x 250 with 100 nonzeros. 

Each matrix A^ has more columns than rows and therefore some 
components of have to be prescribed. 

If sc i are the vectors of unknown components of and 
and Ei are the corresponding submatrices of A^ and Gi, the whole 
model for a single time period can be regarded as a linear 
system of equations of over 3000 variables with a nonsymmetr ic 
bordered block diagonal matrix of coefficients of the form: 


Mi S X 


*i 


bi 

m 2 s 2 


212 


k2 

• *. 


• 

* 

• 

• •' 

M r S r 


• 

2!r 

j 

ir 

E E 2 • • • 0 


w 


0 


where the blank blocks in the matrix are zero blocks. 

When the model was first implemented, the program for 
the solution of (3) inverted the matrices M^ and stored the 
inverses. The approximate computer time to perform this task 
was 4 hours on a PDP-11. The (dense) inverses were saved for 
subsequent runs during which they were updated depending on 
the components of prescribed and on changes in the. matrices 
Ai . Each of these subsequent runs required 110 seconds on an 
IBM 370 for each time period. 

The set of prescribed components of and the matrices 
are used to determine a scenario, i.e., a set of economic 
assumptions. Studies carried out with the World Model compare 
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results of different scenarios, i.e., the implications of the 
different assumptions. The consequences of the introduction 
of new technologies, different development strategies, or 
shifts in trade patterns are among the numerous scenarios that 
can be analyzed. Thus, the World Model is a flexible tool to 
analyze alternative policies. Several large scale empirical 
studies have been carried out with this model. The most recent 
ones are reported in Leontief and Duchin [1983], Leontief and 
Solvn [1982], Leontief, Koo, Nasar and Sohn [1983] and Leontief, 
Mariscal and Sohn [1982], 

To make this tool much more flexible we needed to greatly 
reduce the computational resources required to run a scenario. 

A first step in that direction was the application of sparse 
matrix techniques for the solution of (3). In the present 
implementation the matrices A^ are stored using a sparse 
scheme, i.e., only the nonzero elements are stored, together 
with some integer arrays indicating their locations. A single 
array of approximate length 3200 contains all vectors , i = l,...r. 
Other such arrays contain the vectors b^j_ , the nonzero values 
of the matrices and , or other data objects. Similarly, 
objects like the nonzeros of the matrices M j_ appear in single 
arrays of length close to 5000. 

2 . Method of Solution 

The algorithmic details of the solution of (3) are given in 
Duchin and Szyld [1979], Szyld [1981], and Furlong and Szyld 
[1982] . Here we enumerate the operations for the solution 
of (3) very schematically. 
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loop 

1. 

For i*l , . . . , r 



1.1. 

Read Ai,Gi,Si, and the prescribed elements of ^i 


1.2. 

Produce Mj,E£ and bi 



1.3. 

Obtain factorization of 


loop 

2. 

For i»l , . . . , r 



2.1. 

Prepare different right hand sides with 

columns of Si 


2.2. 

Solve systems with matrix Mi 


loop 

3. 

Obtain w 


loop 

4. 

For i»l f ...,r 



4.1. 

Compute bi - Siw 



4.2. 

Solve MiXi = b i - S^ 



The 

factorization of the matrices Mi (in step 

1.3) and the 


solution of several linear systems with them (in steps 2.2 and 
4.2) are performed with routines from the MA28 set developed 
by Duff [1977] . 

We report the running times for a single time period with 
this method of solution without any vector code in Table 1, 


Table 1 . 


System/compi ler options 

CPU sec . 

IBM 370/168 

00 

n 

IBM 3033 

~ 20 

Cyber 205, no options 

11.46 

Cyber 205, vectorization by the compiler 

9.04 
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Architectural features combined with the sparse matrix 
techniques resulted in running times three to ten times faster 
than the 110 seconds that subsequent runs required after compu- 
tation of the inverses in the first implementation of the 
World Model. The goal is now to obtain vector code for the 
Cyber 205 that will further reduce the overall running time. 

3 . Code vectorization 

The redesign of the World Model software for its efficient 
use on the Cyber 205 was conceived in three phases: 

I. Elementary operations over all regions 

II. The MA28 package inner loops 

III. New concepts for MA28 

Phase I consists essentially of the vectorization of all 
operations except those associated with the factoring of the 
matrices M^ and solutions of - the corresponding linear systems. 
Those operations correspond fro steps 1.2, 2.1, and 4.1. Each 
of these steps has a different structure but they all are 
loops operating on vectors of length about 200, inside another 
loop of length 16. The basic idea was to split the outer loop 
and perform simultaneously the operations on all vectors of the 
different regions, i.e., on vectors of length of about 3200. 
Cyber 205 FORTRAN commands such as scatter, gather and bit 
operations were used throughout. 

We illustrate the vectorization of step 4.1. The length 
of w is about 50. is a rectangular matrix of about 200 rows, 

with only one nonzero entry per column. It is stored as a 
vector with an accompanying integer array indicating in which 
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row each nonzero entry lies. The following FORTRAN statements 

are part of sequential code for step 4.1. 

DO 100 11=1, NREG 
IBEG= ( I I- 1 ) *NTRADE 
IBEGB=IPNTB( II)-1 
DO 50 1=1 , NTRADE 

INDEX=KTRDBG ( IBEG+IO +IBEGB 

B( INDEX) =B ( INDEX) -EXPSH ( I+IBEG) *W( I ) 

50 CONTINUE 
100 CONTINUE 

The running time for these loops was 1008 nsec. Different vec- 
torization options were analyzed. One of them consisted of 
scattering the vectors that contain the nonzero values of 
and w to vectors of length of about 3200 and then performing 
the triad operation. This required 9514 clock cycles, or about 
190 usee. The version adopted performs the multiplication of 
the vectors containing the nonzeros of and w first, a 
vector operation of length about 800, scatters that vector and 
performs the final subtraction in 7250 clock cycles or 145 usee, 
a gain of a factor of 7 from— the sequential code. 

Similar gains have been achieved in the other portions of 
the code vectorized in phase I. Unfortunately only a small 
portion of the total running time of the World Model is spent 
in the code vectorized in phase I. Thus the overall gain was 
relatively small. 

About 30% of the total running time of the World Model is 
spent on routines of the MA28 package in which the matrices M ^ 
are factored (step 1.3), and solutions with many right hand 
sides computed (steps 2.2 and 4.2). At the present time we 
have completed only part of phase II, the vectorization of 
some of the inner loops in the MA28 set. 
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Due to the startup time in any vector operation, it is 
common practice to look into the length of the vectors involved 
in the operation to decide if the vectorization is really worth- 
while. In codes for sparse matrices, the vector length for an 
operation is usually the number of nonzero elements in a particular 
row or column, and thus varies within the code. The technique 
used in this case is to assess if the vector length is above 

a particular value and branch the process of that particular row 

or column to vector or sequential code. The running time of the 

code incorporating these features is 7.33 CPU seconds, cf. 

Table 1. 

Phase III, not yet implemented, consists of reconceptualizing 
the MA28 set. We will investigate the possibility of solving 
several right hand sides simultaneously, as well as other features 
like special treatment of right hand sides with few nonzero 
elements . 
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The object of our project is to calculate the masses of the 
"elementary particles". This ambitious goal apparently is not 
possible using analytic methods or known approximation methods. 
However, it is probable that the power of a modern super computer 
will make at least part of the low lying mass spectrum accessible 
through direct numerical computation. Initial attempts by 
several groups at calculating this spectrum on small lattices of 
space time points have been very promising. Using new methods and 
super computers we have made considerable progress towards 
evaluating the mass spectrum on comparatively large lattices. 
Even so, we are examining regions of space just barely large 
enough to contain the particles being examined. Only more time 
and faster machines with increased storage will allow 
calculations of systems with guaranteed minimal boundary effects. 
In what follows we outline the ideas that currently go into this 
calculation 


While a long time ago it was believed that there were only a 
relatively small number of such objects (for example, protons, 
neutrons , electrons , photons and so on) it is now known that there 


elementary particles. A 
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is a virtual alphabet soup of so called 
partial listing of these in terms of 
description is: 75^ 77“ c L'i 

ur-. 

We emphasize that this list fs but a fraction of the particles 
observed to date, fortunately, the properties of these particles 
suggest a pattern consistent with them in turn being made out of 
a "small" number of more elementary objects called quarks. To 
date, despite many attempts, there are no reliable reports of an 
isolated quark actually being observed. 
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Clearly, a theory is needed that explains the rich particle 
spectrum in terms of quarks and yet is compatible with quarks 
being unobservable if isolated from other matter. Further, from 
past experience with mathematical formulations, it is natural to 
insist that this description be reasonably simple and elegant. 
There is exactly one existing candidate for such a description. 
It is called Quantum Chromodynamics or Q.C.D. It is based on the 
very successful quantized description of the electromagnetic 
field interacting with electrons or Q.E.D. Q.C.D. is more 
complicated than Q.E.D. because the several species of quarks 
needed to explain the group structure of the observed particles 
as well as the confinement of single quarks allows for a very 
rich mathematical structure. This structure is carried in a 
partition function like object which is the exponential of an 
action made of qlue fields (designated by the symbol A and quark 
fields designated by the symbol y . Here we have suppressed the 
space time dependence of these fields as well as the fact that 
each symbol is actually a vector with at least 12 components. The 
interaction described by the action is highly non-linear but any 
term contains either zero or two quark fields which somewhat 
simplifies the formulation. The primary content of the 
assumption that system examined be a quantum field theory is that 
at any given time every point in space has assigned to it 
independent quantized degrees of freedom associated with the qlue 
and quark fields. It is thus very natural to describe space time 
mathematically as a discrete lattice of points with separation a 
that approaches zero. 
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The object 


U(i , j) = 


U ( i , j ) def i ned as 

( > ■) A - U - J ) 


plays a primary role in this theory. It has the property that 
U ( i , j ) = LT( j , i ) . Further the U ( i , j ) are members of the qroup of 
unitary unimodular matrices S U ( 3 ) . For these fields alone we have 
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the (effective) partition function 



Here 



9." 1 


£. Sq 


where the sum is taken over all independent square plaquettes and 


We could stop with this form for the partition function and have 
more work to do than current available machine power will allow. 
However, to calculate the elementary particle spectrum (except 
for glueballs ) we must include the quark fields in our action. 
The form used because of various symmetry and guage principles is 


- - f ^(') V'U)* \ 


Here K is a numerical parameter. The matrix B depends explicitly 
on the glue field A (of course leaving out gravity and weak 
interactions]£is then taken to be 


S - -Sej 


Physics is obtained by calculating the correlation functions or 
vacuum expectations of polynomials of the field (quark and glue) 
of the partition function formed from this action. The general 
problem that must be confronted is the evaluation using the 
appropriate group measure of the following type of integral. 


< /"( t- f> < fly 


s 


This has many variables . Since each U(i,j) is an SU ( 3 ) matrix it 
is specified by 12 numbers. If we study a hypercubic lattice with 
N points in each space-time direction we are dealing with the 
order of N**4*12*4 numbers just associated with the glue fields. 
The quark fields are characterized by (for our discussion) 12 
complex numbers at each lattice point. However, this is just the 
beginning. The quartf it ies are in fact not numbers! They have the 
property that 'fMj) tyll) • This ant i commut i vi ty property 
is essential i'n order that the quarks describe objects with 
intrinsic half integral spin. Because the action S is quadratic 
only in quark fields it is possible (using very natural 
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definitions) to expl ici t ly perform the integration over quark 
fields and leave the problem of evaluation of correlation 
functions expressible entirely in terms of integrals over glue 
fields. For example, if we examine the correlation function of 


four quark fields we have 
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Note that 1-K6 is a (N**4*12)**2 complex matrix. Det(l-KB) is 
more or less unspeakable for any reasonable size of N. Evaluation 
of the correlation function above is essential for determining 
meson masses (such as the pion) in this theory. Calculation of 
correlations of expectations of six quark fields is needed to 
evaluate properties of baryon fields (such as the proton). As a 
practical matter, numerical evaluation of six quark correlations 
is not much more difficult than four quark correlations. Clearly 
as N gets larger the problem gets more complicated. However, we 
are really only interested in the limit when N is very large 
since this corresponds to the infinite physical world. Indeed, we 
want to examine the limit were N becomes infinite and the lattice 
spacing a approaches zero. Linder some circumstances it can be 
argued that neglecting the determinant should not make dramatic 
changes is the nature of the physical answers we obtain. For this 
discussion (and the particular project it is outlining) we chose 
to set the determinant to unity. We are then left with a class of 
integrals to evaluate which can be handled using Monte Carlo 
importance sampling methods in conceivable amounts of time for 
reasonably big lattices. Such systems have been studied 
extensively using Vax (780) computers on lattices with 6**3*14 
points. Using the C.S.U. Cyber 205 it is possible to examine far 
larger systems. Indeed we are in the process of examining (on 
several class 6 computers ) systems with 10**3*24, 12**3*32 and 
20**3*50 lattice sites. 


After neglecting the determinant we are left with the basic 
structure 


< ^ m y<c) 



We evaluate this numerically in two 
probabi 1 i ty 


JPl") = 



steps . 
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First, 


we define a 
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Using Monte Carlo (Metropolis) methods we generate a sequence of 
.glue configurations which are are distributed according to 
itPftjit is important that these distributions be thermal ized and ’ 
statistically independent". By careful tuning of the way the 
Monte Carlo hits are made taking into consideration the nature of 
the group measure we can enormously speed up the decorrelation of 
consecutive lattice configurations. Indeed for most cases, it is 
not difficult to obtain a factor of four increase in speed of 
lattice generation over conventional methods through careful 
tuning. Even careful tuning of the physics of this problem does 
not give reasonable run times for large lattices unless full 
advantage is taken of the possibility of vectorizing the code. To 
do this efficiently we use red black methods of sweeping through 
lattice configurations. In addition, the memory requirements for 
large lattices rapidly become excessive so we use time slicing to 
control our memory allocations. We must do this since the demand 
paging algorithm on the 205 does not work efficiently with the 
codes which are naturally written for this problem. 

After a collection of independent lattices are generated we 
continue to evaluate the basic integral for the problem by 
evaluating the inverse of 1-KB for the guage configurations of 
each lattice. This is somewhat simplified since this inverse need 
be evaluated for only one base site-that is a fixed row of the 
matrix. However, it turns out that this inversion must be carried 
out for three or four different values of the parameter K. The 
method that has been most commonly used to invert the matrix 
employs a Gauss Seidel method. This is slow, taking almost an 
order of magnitude more time than the lattice generation. We have 
other methods under study which for the particular systems 
involved promise to be much faster. The Gauss Seidel method is 
used in a form first applied to this problem by Weingarten. We 
need to evaluate the form 

Here his at a fixed lattice point but can vary through the 12 
values associated with the indices of the quark field at that 
point. This equation is now re-written in the form 

£ ~ 4 l\ £ ~6- 

^ ~ 4 f l\ -f (/ v* ) I 'S- - 4 1 k r J 
■= ( 

i s a parameter which can be tuned in order to obtain the 
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fastest convergence in the solution of this equation by iteration 
in f. In practice we code this procedure using red black ordering 
and time slicing to obtain vectorization and efficient memory 
management . 

After the matrix inversion is performed and the correlations 
are evaluated through weighting over the available lattices we 
must extract physical information from the output functions. The 
easiest information obtained is the masses of the particles 
described by this formalism. It is , for example, a general 
property of the theory that we are dealing with that if we look 
at correlation functions depending on only two space time points 
and then sum over all spatial directions that the resulting time 
dependent functions depend only on sums of exponentials with the 
exponent linear in the masses of the appropriate particles and 
the time separations. It is an easy matter to fit to exponentials 
and extract numerical values for the masses. However to do this 
we must tune the parameters of the theory to match the physical 
mass spectrum at some value of the mass. In effect we have a two 
parameter fit for the entire mass spectrum. It is found however 
that the Gauss Seidel method fails to converge for the physical 
value of the pion mass and hence the need to do the extrapolation 
in K mentioned earlier. After this is done, it has been found 
that on smaller lattices a fairly accurate fit can be obtained to 
the relatively light particles. We expect to find much better 
fits for a large lattices where edge effects should ha ve a 
smaller effect on the calculated results. 
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ABSTRACT 

The full multi grid (FMG) method is applied to the two 
dimensional Poisson equation with Dirichlet boundary 
conditions. This has been chosen as a relatively simple 
test case for examining the efficiency of fully vectorizing 
of the multi grid method. Data structure and programming 
considerations and techniques are discussed, accompanied by 
performance details. 


April 1983 


1. INTRODUCTION 

The multigrid (MG) method has been shown to be a very efficient solver 
for discretized PDE bound ary-valve problems on serial (scalar) computers. 
However, it was not clear how well can the MG approach be adapted to 
execute effectively and efficiently on a vector processor, such as the CDC 
CYBER 205, where considerations other than operations-count may play an 
important role. The purpose of this paper is to report our experience in 
implementing an MG code on the CDC CYBER 205. More specifically, the 
test-case considered is the two— dimensional Poisson equation with Dirichlet 
boundary conditions. It will be assumed here that the reader has some 
familiarity with the philosophy, the motivation and the basic computational 
processes of MG as a fast solver. These processes are described in detail 
in a number of papers in these proceedings and [1] and [2] and references 
therein. The algorithm described in this paper is basically the same as 
the one given in the appendix of [3], whose description is detailed in 
sections 8.1 and 6.4 of [ 3 J - Therefore, no full description of the MG 
algorithm is given here, but the relevant details are included in the 
appropriate context. The main emphasis of this paper is the vectorizatlon 
of these processes. Thus, we will not assume an in-depth knowledge or 
experience in applying MG solvers on a vector— processor type of a computer 
system. 


* Presented at the International Multigrid Conference, Copper Mountain, 

Colorado, April 6-8, 1983. 
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Cons ecru ently, Section 2 contains a brief summary of architectural and 
conceptual features of a vector processor (specific to the CDC CYBER 205), 
which are relevant to this application, as well as software cools available 
for a tight correlation between the hardware and the computational process* 
Sections 3, 4 and 5 are devoted co the description of the techniques used 
for vectorizing the procedures for the relaxation, the residual transfer 
calculation and the interpolation, respectively. The total full multigrid 
(FMG) process and various parameters and constraints are described in 
Section 6 interleaved with convergence and timings (performance) details. 
Finally, Section 7 contains some concluding remarks and comments regarding 
future plans - 


2. VECTOR PROCESSING 

The most significant difference between a traditional, serial computer 
and a vector processor is the ability of the latter to produce a whole 
array ("vector'*) of results upon issuing a single hardware instruction. 

The input to such a vector-instruction may be one or two vectors, one or 
two elements ("scalars"), or a combination of the above. The instructions 
fall into two main categories — those that perform f loating-point arithmetic 
(including square root, sum, dot-product, etc., as well as the basic 
operations), and those which may be collectively called "data-^motion" 
instructions. These may be used, for example, to "gather" elements from 
one array into another using an arbitrary "index-list"; to "compress" or 
"expand** an array; to "merge" two arrays into one (with arbitrary 
"interleaving" patterns), etc. 

The need for vector data-motion instructions becomes apparent when one 
considers the definition of a vector on a CDC CYBER 205. A vector is a set 
(array) of elements occupying consecutive locations In memory. It means, 
by the way, that a vector may be represented in FORTRAN by a multi- 
dimensional array; I.e., a two- or three-dimensional array may be used in 
computations as a single vector. The reason for this vector definition is 
that when performing vector operations on the CDC CYBER 205 the input 
elements are streamed directly from memory to the vector pipes and the 
output is streamed directly back into memory without any intermediate 
registers. 

The timing formula for completing a vector instruction contains two 
components. One is fixed, i.e., independent of the number of elements to 

be computed, and is called "start-up" time. In fact, it amounts to 

start-up and shut-down; it involves fetching the pointers to the input and 

output streams, aligning the arrays so as to eliminate bank conflicts and 
getting the first pair of operands to the functional unit (the pipe-line) 
and the last one back to memory. Typical time for the "start-up" component 
is 1 microsecond, or about 50 cycles (clock periods). The other component 
of the timing formula is the "stream- time" which is proporational to the 
number of elements in the vector. The result rate for a 2-pipe CDC CYBER 
205 for an add or multiply is 2 results per cycle. It is apparent now that 
in order to offset the "wasted" cycles of start-up times it is beneficial 
to work with longer vectors. The system is better utilized if a single 
operation is performed on a long vector, rather than several operations to 
compute the same number of results. Given a vector length, N, one can 
evaluate the efficiency of the computation as the ratio between the number 
of cycles used to compute results and the total number of cycles the 
instruction has taken; I.e., (N/2)/(N/2 + 50). The maximum vector length 
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Che CDC CYBER 205 hardware allows is 65,535 elements. The start -up time 
becomes quite negligible long before that* 

The vector "arguments” for vector instructions are inserted through a 
construct called Descriptor. It is a quantity occupying 64 bits which 
fully describes a vector through two integer values: one is the virtual 
address of the starting location of the vector, the other is the number of 
elements, or the length, of the vector* An element may be a bit, a byte, a 
half-word (32-bits) or a word (64-bits) depending on the intruction and the 
argument within the instruction. The CDC CYBER 205 FORTRAN provides the 
ability to declare variables of "type" Descriptor and Bit, as well as, 
extensions for assigning Descriptors to arrays and syntax for coding vector 
Instructions without such an explicit association* Bit arrays occupy 
exactly one bit per element, since the CDC CYBER 205 is bit-addressable* 

Bit vectors are used for creating a "mapping" between an array containing 
numerical values and a subset of it* A Bit vector may be used to control a 
vector floating-point operation (hence the term "control-vector" which Is 
commonly used for a Bit vector) as follows: Take, for example, an add 

operation. All the elements of the two input arrays are added up, but only 
those result elements where the corresponding element of the control-vector 
is 1 will be stored into the results vector. The other elements will not 
be modified* Alternatively, one may specify storing on zeros in the 
control-vector, and discarding results corresponding to a 1. 

Another common use of bit vectors is associated with some of the data- 
motion Instructions. Two examples will be given here: The "compress" 

instruction Is used to create a vector which is a subset of another vector. 
This operation has two input descriptors — one points to a numeric vector, 
the other to a bit vector. Whenever a 1 is encountered in the bit-vector 
the corresponding numeric element is moved to the next location of the 
output vector, i.e., the input array is "compressed" (the reverse process 
may be accomplished with an "expand" instruction). A single bit-vector may 
also be used to "merge" two numeric vectors Into one. The bit-vector is 
scanned and when a 1 is encountered the next element of the first input 
vector is put into the next location of the output vector, when a zero is 
found in the bit-vector the next element of the second input vector is 
moved into the next location of the output vector. The timing for both 
these instructions is dictated by the total length of the bit-vector. The 
result-rate is the same as that of vector arithmetic, i.e., on a two-pipe 
CYBER 205 it Is two elemets per cycle (whether they are moved or not). It 
will be noted here that there are vector instructions for creating repeated 
bit patterns at a rate of 16 bits per cycle. 

Before concluding this section let us briefly mention the existence of 
an "average" instruction, which computes an average of two vectors, or 
adjacent means of a single vector, at the rate of a single floating-point 
operation. One can also "link", for example, an add and a multiply opera- 
tion, provided at least one of the three inputs is a "scalar", and perform 
the two operations as if it were only one. All the instructions mentioned 
above are directly available through Fortran In-line function calls. 


3» RELAXATION 

Now we are ready to examine the ways in which to utilize the tools and 
the vector processing concepts discussed in the previous section for 
vectorizing the Multigrid application. The success of such an exercise 
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hinges, to a large extent, upon the efficiency with which the relaxation 
process may be accomplished* 

Discretization of the two-dimensational Poisson equation is achieved 
via the 5-points differencing scheme. Thus, assuming geometric interpreta- 
tion of the indices for the moment, the set of the simultaneous equations 
to be solved may be written as 

+ + u i+l , j + u i , j+1 * 4 * “i.j ■ h2F l,j 

where u is the unknown function, h is the Interval between two grid points 
(in either direction) and F is the right-hand side function. i varies from 
2 to Nj-1 and j from 2 to N 2 -I, where and N 2 are the number of 
grid points along the two directions. 

One may want to consider the usual (lexicographic) Gauss-Seidel relaxa- 
tion procedure. This, however, will be in conflict with vectorization, as 
may be easily deduced. The Gauss-Seidel relaxation is characterized by the 
use of updated values as soon as they become available. Vectorization means 
processing many such values in parallel, i.e., not waiting for the previous 
element to be updated. The obvious alternative is Che red-black or 
checker-board ordering, where all the four neighbors of each point belong 
to the other "color** . The convention used here is that the "color" of the 
grid points at the corners of the rectangle is red. The grid may accord- 
ingly be divided into two vectors and the relaxation performed in two 
stages: first, the values at red points are updated using "old*' values, 

then the values at black points are updated using the "new" red values. 
Throughout the code the two vectors of the unknown function (and of the HHS 
function) are stored consecutively following each other, where inside each 
vector the values are stored column-wise as shown in Figure 1. This 
storage applies, of course, to all the grids used. 
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Figure 1. Mapping of the Lexicographic into the "Red-Black" Ordering. The 
dotted line indicates the separations of the grid points into two vectors. 
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The reader will notice that the vectors thus created are not confined 
to one column, but extend over Che entire grid. It was done in order Co 
achieve longer vectors in line with the desire expressed in Section 2* 

This, however. Introduces the hazard of overwriting values residing on Che 
boundary of the grid* To avoid this a bit control-vector was created for 
each grid, in a set-up routine, which contains zeros where boundary points 
exist and ones for interior points. We use chls "boundary control veccor" 
to assure storing new values only into che interior of Che grid. 

The computation requires che sum of che 4 neighbors for each grid 
point. One can easily verify that, using vector add operations this can be 
done with two operations only. One Co add a vector into itself, with some 
offset (e.g., start with elements 2 and 5 in Figure 1) and the second Co 
add che resulcant vector into Itself (with some other appropriate offset). 
The remaining calculation involves subtracting the result from che RHS 
values and multiply by a constant (being -0.25), which is accomplished as a 
linked-triad operation; che result is then stored into place under the 
control of che boundary bit-vector. Thus, each of the cwo stages (two 
"colors") requires three floating-point operations using vector length of, 
approximately, (N^ * N2)/2 elements long. In fact, some more savings 
in the computations occur in the first relaxation sweep after moving to a 
coarser grid, since the sum of the "neighbors" need not be computed for the 
first "color," being known to be zero. This is because we are beginning to 
compute a correction-function whose first approximation is zero. The 
vector-operations count for this relaxation sweep is thus reduced from 6 to 
4, Also, when transferring a solution-function (not "correction") to a 
finer grid, as part of the FMG process, an interpolation can be used which 
will save the relaxation on the first "color" (see Sec. 5). 

In conclusion, the relaxation process can obviously be done extremely 
fast on the CYBER 205. Timing details will be given in Section 6. 


4. FINE TO COARSE RESIDUAL TRANSFER 

Residuals have to be computed at chose fine-grid points which also 
belong to the coarser grid. These residuals are directly transferred to 
the corresponding coarse-grid points weighted by 1/2 ("half injection"; the 
factor of 1/2 is motivated by the fact that the fine-grid residual is zero 
at black fine-grid points, hence che ocher residuals should be multiplied 
by 1/2 to represent the correct average). See Figure 2. 

The computation Involves four floating-point operations (two of them 
are linked triads) for evaluating the residuals of the red points on the 
flner-grld and multiplying them by 1/2. This, however, does not conclude 
the procedure. At this stage ve need to apply che "compress" operation 
three times as follows: using a pre-deflned bit-vector ve extract the 

residual values corresponding to coerse-grid points, i.e, belonging to 
odd-numbered columns of the red section of the finer grid. (Note chat we 
have thrown avsy half the calculated residuals. This procedure is both 
simpler and a little faster than having to perform all the compress 
operations needed for computing only the required residuals.) Nov, as is 
evident from Figure 2, ve have ail the desired values for the coarser grid 
stored in lexicographic order. To separate them into "red" and "black" 
sections the "compress" instruction is applied twice (once for each color) 
using a pre-defined "picket fence" bit-vector. The procedure as described 
here produces optimum performance even though some redundant operations are 
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performed. The alternatives are to perform different (more "costly") data 
notions or to operate on much shorter vectors. Finally, another vector 
operation is executed to zero out the unknown function of the coarser grid 
in preparation for evaluating the correction function. In total the proce- 
dure requires 8 vector "start-ups" associated with 5 operations of approxi- 
mate length of (Nj * N 2 >/ 2 , and 3 operations of length (N^ * N2>/4, where 
N} and N 2 are dimensions of the finer grid. 



Figure 2. Transfer to a Coarser Grid: The residual calculation. Each 
"Box" contains the fine grid points involved in the computation for the 
corresponding coarse grid point. 


5. INTERPOLATION 

Interpolation, in the context of this paper, is the process by which we 
transfer from a given grid to a finer one. Two types of interpolations are 
employed here: Type I interpolation is used when a correction is interpo- 

lated from the coarser grid and added to the finer grid. The Type II inter- 
polation is used to compute a first approximation on the finer grid, based 
on existing values on the coarser grid. The use of the red-black ordering, 
combined with the fact that a relaxation always follows an interpolation, 
implies that only one color of the finer-grid points need to be interpolated 
(the other color will be computed by a relaxation pass on chat color). 

Type I interpolation is bilinear employing points as shown in Figure 3. 
Only interior black points on the finer grid need to be evaluated. Due to 
the required averaging of the coarse grid values it is convenient to first 
merge the red and black points of this grid using the "picket-fence" bit 
vector to produce the lexicographic ordering. Next, two averages are 
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computed* The average over the coarse grid* where the two input vectors 
are offset by a column* will produce the quantities to be added into black 
points on even-numbered columns on the fine grid. A second average* where 
the offset between the two vectors is one element* is executed for fine 
grid black points corresponding to odd numbered columns* This last opera- 
tion produces red un da n t values (at the end of each coarse grid col umn ) 
which are thrown away uaing the "compress 11 operation with an appropriate 
P re "“defined bit vector* The two resultant coarse grid "average-vectors" 
are then interleaved* using a "merge" instruction* under the control of the 
bit vector where the "l's" and "0's" correspond to odd and even columns, 
respectively* Finally* the merged values are added to "black" points of 
the finer grid under the control of the "boundary" bit^vector which inhibits 
storing values into the boundary of the grid* The whole procedure amounts 
to 3 floating-point operations, 2 "merges" and 1 "compress*" The 6 vector 
operations may also be divided into 4 operations of length (Nj * N2)/4 
and 2 operations of length (Nj * N 2 )/ 2 * approximately* (N^ and N 2 
are the dimensions of the finer grid.) 
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Figure 3* Type I Interpolation. It shows where averages of coarse grid 
values are added Into "Black" points on the fine grid. 


Type II interpolation is a 4th order one, described, for example, in 
section 6.4 of [3]. It produces new red unknown- function values on a finer 
grid using rotated difference operators. The values at the black points 
are produced by half a relaxation sweep, l.e., a relaxation pass over the 
fine-grid black points. (This pass may be regarded as part of the interpo- 
lation process. In the timing tables below* however, the time spent in 
this pass Is counted as relaxation time.) The process is described picto— 
rially in Figure 4. All the interior coarse grid values are moved to occupy 
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the corresponding fine-grid points. The relaxation operator is applied to 
these values in order to compute interior red points of the even-numbered 
columns on the fine grid. The only difference betveen the relaxation here 
and the one described in Section 3 is that the operator is the "rotated" 

5— point Laplaclan and the interval betveen each point and its neighbors is 
changed from h to ^2*h. The RHS function values required for this relaxa- 
tion are available from the fine grid RHS array (a "compress" operation is 
performed to retrieve even-numbered column values). The whole procedure, 
thus, requires 2 "merges" (one for merging red-black values of the coarse 
grid, the other for merging the "transferred" and "relaxed" values of the 
red fine grid points); 3 floating-point operations for the relaxation; 2 
"compress" operations (one for throwing away redundant, incorrect averages 
and one for collecting RHS values); and, finally, one vector-move operation 
under the control of the boundary bit-vector for storing the new red fine 
grid values into place. Five out of the 8 vector operations have length of 
about (Ni * N2)/4, the other 3 are associated with a length of (Nj * N2)/2; 
Mi and M 2 being the dimensions of the finer grid. 


XOXOXOXOX 



Figure 4. Type II Interpolation. Coarse grid values are transferred to 
odd numbered col umns on the fine grid. These values are used to compute, 
via the relaxation operator, the even numbered column values. 


6. PERFORMANCE AND CONVERGENCE 

The basic computational procedures, studied in the previous three 
sections, can now be linked together to form the FHG process. Figure 3 is 
a s chema tic description of the sequence of events which leads to an 
approximate solution of the difference equations. The finest grid (where a 
solution is sought) Is assigned the highest level number. The example 
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depicted Is Figure 5 describes an FUG with 5 levels where the process 
scares at level number 2. Ibis may not be necessary, as will be argued 
below, end one may visualize Che FMG scarring at a higher level slaply by 
deleelng Che left-hand-side of Che figure. This starting level is a 
parameter controlled by Che user. The FMG shown in Figure 5 is composed of 
what is known as "V" cycles. In each "V” cycle one performs relaxation— 
residual calculation-re 1 axa cion. . .until reaching Che coarsest grid, Chen a 
sequence of interpolation-relaxation is executed. The transfer from one 
"V' cycle to che next is achieved via Type II interpolation. More 
specifically, che FUG we Implemented may be characterized as 
FMG (M,I,11, 22,23,24), where M is the auaber of levels and L is ehe 
starting level; SI and 12 indicate che number of relaxations before moving 
Co a coarser grid and before moving Co a finer grid, respectively. H3 and 
&4 have che same meaning and apply to che last "V* cycle only. All these 
parameters are provided by che user. The user may also specify che size of 
Che coarsest grid co be esed. It must have an even number of intervals in 
each direction. (In our experiments che coarsest grid had 3 by 3 points; 
i.e., 2 by 2 intervals.) The user also specifies che mesh size h (assumed 
co be che same in both directions) on che finest grid. 



Figure S. The Full Multigrid (FMG) Process: FMG (5, 2, SI, S2, S3, S4). 

The circles indicate che auaber of relaxations performed at a given level. 
Downwards arrow signifies residual calculaelon between relaxations, upwards 
arrow implies interpolation. (When a level is encountered for che first 
time che interpolation is of Type 11, indicated by a double line above, 
otherwise it is of Type I.) When level 1 contains only one interior point 
only one relaxation sweep is performed thereon, regardless of ehe values 
given Co SI and S3. 


The process described above is deterministic , in che sense chat che 
user defines che steps to be taken, based on prior knowledge of che 
characteristics and smoothness of ehe function Co be solved. It is also 
known chat if L»2 che FMG guarantees a solution error smaller chan Che 
truncation error (Introduced by che differencing scheme), for 1*2 norm, 
for example. We have allowed, however, as a user-option, Che evaluation of 
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Che 1*2 and L M norms of Che residual at wlous points. Testing was 
done for problems which have solution of Che fora: 

C * cos (It (x +■ 2y» 

with and without the addition of a 6th degree polynomial which vanishes on 
the boundary. In all these cases the FMG process with L*2 indeed produced 
a solution with an algebraic error (error in solving the difference 
equations) much smaller than the truncation error, in the L^, L 2 and L*. 
norms. 

Only ,f V(2,l)" cycles were used for the resales and timings to be quoted 
here. This turns out to be the optimum combination for the Poisson 
equation. More relaxations at each stage do not improve the final result 
enough to justify the additional work, less relaxations may cause deteri- 
oration in the accuracy. (If full weighting ware used instead of half 
injection, the optimal cycle would be "V(l,l) w . This would, however, be 
less efficient than the present procedure since full weighting is substan- 
tially more costly than a relaxation sweep.) In the performance details 
which follow, we will give results for various values of L since, in many 
cases, in particular when a reasonable Initial guess is available, high 
values of L, even L»M, may provide sufficient accuracy. This is, in 
particular, the situation when the Poisson solver is used within some 
external iterative process, or at each time step of an evolution problem. 

Before discussing the timings we should briefly mention some set-up 
procedures. A routine is provided for re-ordacing the initial array (from 
lexicographic to red-black) if it is not so structured yet. This is done 
through two "picket— fence compress" operations and amounts to 0.185 msecs, 
for a 65 by 65 grid, for example. Putting the solution back into lexico- 
graphic order is done with a single "merge" instruction and takes half as 
long. Next, there is a routine which defines various pointers and lengths 
for all the grids used, as well as the bit-vectors discussed earlier. For 
many applications, where the solver is used many times with the same grid 
definition, this will be done only once. It will not, therefore, be 
included in the total times quoted below (it takas 0.29 msecs, for a 65 by 
65 grid with 6 levels). The last set-up routine is included in the timings 
information. This routine defines the boundary values and the RHS for all 
the levels between L and ft- 1. It also seta tha initial guess on the level 
L grid. 

The code was run with grid sizes of 33 by 33, 65 by 65 and 129 by 129 
(M ■ 5, 6 and 7, respectively) with L«2,...,M. Total execution times are 
given in Table 1. It shows, for example, that a 65 by 65 grid may be 
solved in as little as 1 msec., and, at most, in 2 msecs. By examining the 
processing time per grid-point one can see tha effect of vector- ins truer ions 
start-up times or the dependence of the performance upon vector lengths. 

On a serial processor the time per element would have been, approximately, 
a constant across each line in Table 1. We observe, however, that the 
processing of the 129 by 129 grid is roughly twice as efficient as that of 
the 33 by 33 grid. This is due to the fact that even though the number of 
vector "start-ups" remains nearly the same (across a given line), the 
number of elements solved for has increased by a factor of 16. Hence, more 
time is spent doing useful arithmetic in the vector pipelines. 
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TABLE 1. Execution times for various parameters of the FMG. The entries 
on the left are total times in milliseconds. The entries enclosed in 
parenthesis are the execution times in microseconds per grid-point (only 
interior points are taken into account). 


M-L+l 


33 by 33 
(M - 5) 


65 by 65 
(M - 6) 


1 

I 0.360 

CO. 37) 

1.006 

(0.25) 

1 

3.293 

(0.20) 

2 

| 0.604 

(0.63) 

1.552 

(0.39) 

1 

4.910 

(0.30) 

3 

| 0.729 

(0.76) 

1.810 

(0.46) 

1 

5.440 

(0.34) 

4 

I 0.801 

(0.83) | 

I 1.947 

(0.49) 

1 

5.687 

(0.35) 

5 

1 

1 

I 2.009 

(0.51) 

1 

5.807 

(0.36) 

6 

1 

1 

1 


1 

5.875 

(0.36) 


129 by 129 
(M - 7) 


Tables 2 and 3 present a more detailed analysis of timings for a single 
example, namely for solving a 129 by 129 grid with 7 levels and starting at 
level 2. The entries in Table 2 show timings in msecs, by level and by 
procedure. One notices that the total time spent performing relaxations is 
less than 502 of the total time. This is to be compared against the 30-902 
of total time used for relaxations on a serial processor. This is, of 
course, due to the fact that the vectorized relaxation is extremely 
efficient and does not involve any data-motion operations. The interpola- 
tion and the residual calculations, though fully vectorized* involve some 
data-motion operations, and, therefore, consume a relatively higher propor- 
tion of the execution time than they would on a "scalar 1 ' computer. Another 
observation worth mentioning is that the contributions to all the procedures 
arising from levels 2 to 4 is roughly the same, even though the amount of 
work differs by a factor of 4 between levels. This is a consequence of the 
relatively short vectors which characterize the coarser grids. It also 
explains the larger weight the coarse grids have in the vectorized code 
compared to that of the serial process. 


TABLE 2. Execution times in milliseconds for solving a 129 by 129 grid 
with starting level 2. Breakdown by procedure and by level. For the 
residual calculation and the interpolations the entry in the table 
corresponds to the finer grid involved. 



Level 

I Grid 

I Initial!- 
I zation 

1 

| Relaxa- 
f tion 

I Residual 
| Calcula- 
1 tion 

i 

| Interpolation 
I Type 1 I Type II 

I To tal 

1 

(3x3) 

1 

I 0.010 


1 


0.010 

2 

(5x5) 

| 0.011 

1 0.179 

I 0.014 

I 0.011 


0.215 

3 

(9x9) 

I 0.015 

I 0.160 

| 0.060 

I 0.049 

! 0.024 

0.308 

4 

(17x17) 

I 0.034 

I 0.189 

I 0.068 

1 0.053 

I 0.028 

1 0.372 

5 

(33x33) 

I 0.106 

! 0.320 

| 0.117 

I 0.095 

I 0.053 

| 0.691 

6 

(65x65) 

I 0.388 

I 0.690 

I 0.261 

| 0.194 

I 0.141 

1 1.674 

7 

(129x129) 

1 

1 

I 1.257 

i 

I 0.497 

I 0.357 
■ 

| 0.494 

I 2.605 


TOTAL 

1 0.554 

i 

I 2.805 

| 1.017 

1 

1 0.759 

I 0.740 

1 5.875 
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In Table 3 we have measured the time in microseconds for each time a 
procedure is executed for a given level, accompanied by the number of times 
the procedure is performed- It should be noted here that when level 1 is 
involved in any of the procedures a scalar code was used, since it has only 
one interior point- Again, the effect of vector lengths is such that the 
level 3 relaxation is comparable to that of level 2, for example. Only 
when we get to the finest grids do we observe timing ratios which 
correspond to the ratios of the number of elements processed. The reader 
should be reminded that the average time of the relaxation procedure is not 
fully accurate, since some relaxations are not quite "complete 1 * as was 
explained in Section 3 (i.e. f after Type II interpolation and after 
residual calculation). The residual calculation takes longer than the 
relaxation (in contrast to the scalar case), which is understandable from 
the discussion in Sections 3 and 4. 

TABLE 3. Procedure -calls count and average times in microseconds per 
call. Breakdown by levels for the 129 by 129 problem with starting level 2. 

Note: Some of the relaxations are not "complete." (See Section 3) 




Level 

1 

1 

Relaxation 
No . | Time 

1 

I 

1 

Residual 
No. I Time I 

Interpolation 
Type I Type 

No. | Time No. | 

II 

Time 



1 

(3x3) 

1 

6 

1 1.7 

i 





1 

1 




2 

(5x5) 

I 

18 

I 9.9 

i 

6 

1 2.3 

6 

1.8 

1 

1 




3 

(9x9) 

1 

15 

I 10.7 

i 

5 

| 12.0 

5 

1 9.8 

1 

1 | 

24.0 



4 

(17x17) 

I 

12 

I 15.8 

i 

4 

| 17.0 

4 

1 13.3 

1 

1 [ 

28.0 



5 

(33x33) 

1 

9 

35.6 


3 

i 39.0 

3 

31.7 

1 

1 | 

53.0 



6 

(65x65) 

i 

6 

1 115.0 

i 

2 

| 130.5 

2 

1 97.0 

1 

1 ! 

141.0 



7 

(129x129) 

! 

3 

I 419.0 

_j_ 

1 

| 497.0 

1 

1 357.0 

1 

1 | 

494.0 



To conclude the performance discussion we will mention that the vector- 
ized code executes about 15 times faster than the scalar version on the CDC 
CYBER 205, and roughly 500 times faster than the CDC CTBEE 720. 

The lesson from what was said above is that relaxations are relatively 
"cheap" in terms of execution times , and computations on the coarser grids 
are realtively "costly" (compared with the ratios found on scalar 
processors ) • 


7. CONCLUDING REMARKS 

One important lesson, known very well to those involved in vector 
processing, is that it demands careful data structuring and analysis of the 
"mapping" between the data and the operations to be performed, if the 
vector capabilities of the processor are to be efficiently utilized. We 
have also demonstrated that the traditional operations— count as a measure 
of processing time is not sufficient. On a vector processor one has to 
take into account the number of vector operations (or the lengths of the 
vectors) and the data-motion operations (which occur on a serial processor, 
too, but are often Ignored when algorithms are evaluated). The result of 
the above is that one may have to re-examine the various parameters of the 
algorithm when migrating the Multigrid application from a serial to a 
vector processor. This aspect requires further investigation. 
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We feel that Che experiment with the model-case studied In this paper 
was successful and the performance achieved very pleasing. It certainly 
warrants continuation work. Some obvious areas we intend to engage in are 
the following: Extending the application to three-dimensional Poisson 

equations; code a similar application to cater for the, more general. 
Diffusion equation; and implement "full-weighting" residual calculation and 
cubic interpolation. In addition one may, of course, generalize this work 
in many directions. More general boundary conditions (Neumann, etc.) can 
be implemented. The solution of non-linear problems (using FAS multigrid 
version) and systems of equations can also be vectorized in a similar 
fashion. More difficult, but potentially important, is the extension to 
general domains, which will require a lot of thought about data structures 
and data motion. As a last comment. It will be noted that all the timings 
quoted here were achieved using 64-bit arithmetic. On the CDC CTBER 205 
one can use 32-bit arithmetic as well, and, thus, double the result rate 
for vector operations while halving the memory requirements. For the 
purpose of obtaining albebraic errors smaller than truncation errors in 
solving second order equations, the 32-blt arithmetic is indeed enough. We 
intend to examine this option. 
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ABSTRACT 


Ray tracing is a widely used method for producing realistic computer-generated images. 
Ray tracing involves firing an imaginary ray from a view point, through a point on an image 
plane, into a three dimensional scene. The intersection of the ray with the objects in the scene 
determines what is visible at that point on the image plane. This process must be repeated 
many times, once for each point (commonly called a pixel) in the image plane. A typical image 
contains more than a million pixels making this process computationally expensive. A tradi- 
tional ray tracing program processes one ray at a time. In such a serial approach, as much as 
ninety percent of the execution time is spent computing the intersection of a ray with the sur- 
faces in the scene. With the CYBER 205, many rays can be intersected with all the bodies in 
the scene with a single series of vector operations. Vectorization of this intersection process 
results in large decreases in computation time. 

The CADLAB’s interest in ray tracing stems from the need to produce realistic images of 
mechanical parts. A high quality image of a part during the design process can increase the 
productivity of the designer by helping him visualize the results of his work. To be useful in 
the design process, these images must be produced in a reasonable amount of time. This discus- 
sion will explain how the ray tracing process was vectorized and gives examples of the images 
obtained. 
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GEOMETRIC MODELING AND MECHANICAL DESIGN 


In mechanical design, there are two broad reasons for nsing the computer: (1) predict 
behavior, and (2) visualize. Behavior that needs to be predicted includes every test that one 
would normally perform if given a physical prototype of the design: weight, center of gravity, 
strength, movement, clearances, etc. This is why a computer model of a part is often referred 
to as a “virtual prototype/ Visualization is, in effect, another form of behavior prediction. In 
this case, knowing the actual appearance of a proposed design is a valuable aid in conceptualiz- 
ing. 

In order to feed information into visualization and analysis routines, a geometric model of the 
design must first be created. In the early days of computer aided engineering, a wireframe data- 
base was used to model the part shape. This was deemed inadequate, because the wireframe 
could only model a part’s edges, not its solid volume . 

One of the methods by which we model part shapes in the CAD LAB is with a newer tech- 
nique called Solid Modeling . A solid modeling database has sufficient geometric information to 
completely and unambiguously define the shape of a three dimensional object. One method of 
building a solid model database is with a technique called Constructive Solid Geometry , or CSG. 
A CSG geometric creation sequence is characterized by applying boolean operators (union, 
difference, intersection) to groups of primitive shapes (boxes, cylinders, cones, etc). Complex 
designs may be created in this manner, with the results being sufficient to drive visualization 
and other analyses. The remainder of this report will discuss the use of the CYBER 205 to pro- 
duce image information in order to view an object constructed using CSG operations. 
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INTERSECTIONS OF RAYS WITH A PRIMITIVE 


One nice side effect of using a CSG representation is that the resulting object can easily be 
displayed using ray tracing. Ray tracing involves firing an imaginary ray from a view point, 
through a point on an image plane, into a three dimensional scene. It is not mathematically 
feasible to determine the visible surface of an entire CSG object in a single computation. How- 
ever, it is fairly easy to determine the intersection of a ray with each of the individual primitives 
which make up a CSG object. Then, a little more calculation produces the point along that ray 
which is visible. If one ray is fired through every pixel in the image plane, an image of the 
object is obtained (see Figure 1). 



view 

point 
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The typical (serial) ray tracing program must: 


• Intersect all primitives in the scene with one ray. 

• Traverse the CSG database to determine which primitive intersection is the visible surface 
for that ray. 

• Determine the surface intensity using the surface relationship between the surface normal, 
the eye position, and the position(s) of the light source(s). 

This is the visible surface algorithm. It is repeated at every picture element (pixel) in the image 
plane. 

The intersection of the ray with the primitives is by far the most time consuming part of the 
visible surface algorithm. However, it is also the easiest part of the algorithm to vectorize. 
Instead of just finding the intersection of one ray with a primitive, a queue of rays is built (seri- 
ally as in a traditional ray tracing program). Then the intersections of each primitive with 
every ray in the queue is found in a series of vector operations. Table I gives computation 
times for 100,000 rays intersecting a sphere and a cylinder primitive. For the vector results in 
this table, a queue length of 2000 rays was used. 

FINDING A RAY’S VISIBLE SURFACE 

The above timings are only for the lowest level in the visible surface algorithm. After all the 
intersections are found, the CSG database must. still be traversed to determine which primitive 
intersection is the visible surface for that ray. This constrains the length of the ray queue, since 
it implies that all the ray intersection information must be stored (after the intersection calcula- 
tion) and then retrieved (for the visible surface calculation). If the ray queue is too long, the 

time spent page faulting will be enormous. For this reason, the ray queue in our application is 
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TABLE 1 


CPU Time** 


Primitive 

Cyber 205 
Scalar 

Cyber 205 
Vector 

Cyber 720 

sphere 

.944 

.0279 

13.1 

cylinder 

2.729 

.1614 

51.48 

steiner 

11.157 

1.047 

216.0 


Speedup 1 


Primitive 

o 205 rector 
^205 scalar 

c 205 rector 
^720 

sphere 

33.81 

469 

cylinder 

16.91 

318 

steiner 

10.67 

206 


2 CPU times are in seconds 


3 Speedup = Sp^ 


CPU time P t 
CPU time P t 
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approximately 2000 rays. The visible surface algorithm has not yet been vectorized. However, 
it is apparent that at least parts of this process are vectorizable. 

SPECIAL EFFECTS 

One of the reasons ray tracing has been so widely accepted is that it can show very realistic 
image synthesis effects. Shadows are perhaps the easiest extension to the algorithms described 
above. To determine if a visible surface is in a shadow, one ray must be fired toward each light 
source from the visible surface. If this ray hits a solid object before it encounters the light 
source, the visible surface is in a shadow. Reflection can be shown by spawning another ray 
from each surface such that the angle of reflection equals the angle of incidence. Transparency 
and refraction can be modeled if a refraction ray is spawned after a hit on a solid, transparent 
object. What should be clear from these special effects is that the extra rays to be fired do not 
come in a predictable, vectorizable progression. However, after a serial section of code has 
determined that another ray must be fired, this ray can be placed in the queue and intersected 
using vector code when the queue is full. 

SURFACE PATCHES 


Surface patches are used in computer aided design to sculpt the surface of a part that would 
be difficult or impossible to model using conventional primitives such as cylinders and boxes. 
Hence, surface patches play an important role in the design process of parts such as air foils and 
car bodies. At the CADLAB we are currently investigating the uses of Steiner surfaces as a 
sculpting device. Ray tracing is then used to visualize the resulting sculpted surface. 

A Steiner surface is a bi-quadratic surface. This means that computing the intersection of a 
ray with a Steiner surface requires the solving of a quartic equation. Approximately 05 precent 


320 



of the computation time for this intersection calculation involves the solving of the quartic equa- 
tion while the rest is attributed to the determination of the coefficients for the quartic equation. 
The determination of the polynomial coefficients is a straight forward process and is easily vec- 
torized. Vectorizing the process by which a queue of rays may be intersected with a Steiner sur- 
face requires the vectorization of the root solver used for solving the quartic. For our applica- 
tion we are only interested in the first positive real root closest to zero. Table 1 shows the 
results of vectorizing the Steiner intersection process. 

To determine the roots of the quartic polynomial the slope and curvature functions (i.e. the 
first and second derivatives) are examined to determine the intervals over which a possible solu- 
tion exists. Modified Regula Falsi is then used to determine the roots within these intervals. 
Once a root is found it is evaluated to see if the root is acceptable. 

The vectorized version of the root solver finds the roots of a series of quartic polynomials, 
each polynomial corresponding to a ray in the ray queue. The roots for all the polynomials 
must be found before the process can complete. Unlike the scalar version, it is most likely that 
all four roots will have to be determined and evaluated as it is likely that at least one ray will 
not intersect the surface. This process is sped up by ensuring that a sign change does not occur 
before using the Falsi method to determine subsequent roots once an acceptable root has been 
found for a particular polynomial. Gather-scatters are then used to compress the vectors used 
during these iterative processes. Convergence occurs when all of the roots being found converge 
within the specified tolerance. 

The quartic root solver can be used for a variety of applications. One extension to the ray 
tracing program will be the inclusion of tori and other elliptical surfaces as primitives. These 
primitives will also require solving a fourth order equation to determine the intersection of a ray 
with their surface. 


321 



OTHER APPLICATIONS 


Another application of raj tracing at Purdue is radiant heat transfer analysis of finned 
Tubes [MAXW83J.4 Rays are fired to determine the radiation shape factor of one or more finned 
tubes. Unlike the visualization of a CSG object, maximum length vector operations may be 
used since it is only of interest knowing that the ray strikes the tube and not where on the 
tube. The computational requirements of this application have been reduced from 600 seconds 
on a CDC 6600 down to 3 seconds on the CYBER 205. 

CONCLUSION 

Ray tracing is, in general, a parallel algorithm. This paper examined how the parallel algorithm 
can be modified for use on a vector computer. In design work, the speed with which results are 
available is often critical. Vectorization of ray tracing programs promises shorter execution 
times. This will benefit not only visualization, but also such diverse areas as heat transfer, mass 
properties analysis, and nuclear engineering. 


4 [MAXW83| Maxwell, G.M., “Mathematical Modelling of a Gas Fired Swimming Pool Water Heater 4 5 , Ph.D. 

Thesis, Purdue University, in preparation. 
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ABSTRACT: 


Conventional finite-difference migration has relied on approximations to 
the acoustic wave equation which allow energy to propagate only downwards. 
Although generally reliable, such approaches usually do not yield an accurate 
migration for geological structures with strong lateral velocity variations or 
with steeply dipping reflectors. An earlier study by D. Kosloff and E. Baysal 
( Migration with the Full Acoustic Wave Equation ) examined an alternative approach 
based on the full acoustic wave equation. The 2D, Fourier-type algorithm which 
was developed was tested by Kosloff and Baysal against synthetic data and against 
physical model data. The results indicated that such a scheme gives accurate 
migration for complicated structures. This paper describes the development and 
testing of a vectorized, 3D migration program for the CYBER 205 using the 
Koslof f/3aysal method. The program can accept as many as 65,536 zero-offset 
(stacked) traces. In order to efficiently process a data cube of such magnitude, 
(65 million data values), data motion aspects of the program employ the CDC 
supplied subroutine SLICE4, which provides high speed input/output, taking advan- 
tage of the efficiency of the system-provided subroutines Q7BUFIN and Q7BUF0UT 
and of the parallelism achievable by distributing data transfer over four differ- 
ent input/output channels. The results obtained are consistent with those of 
Kosloff and Baysal. Additional investigations, based upon the work reported in 
this paper, are in progress. 
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i mscEucrioi 


1.1 


snm 


In an attempt to develop a migration technique that did not have 
the faults of conventional finite-difference migration techniques, 
Kosloff and Bay sal introduced a migration technique based on the full 
acoustic wave equation til. While conventional finite-difference 
techniques used an approximation to the wave equation, they allowed 
energy to propagate only downwards. Although these techniques yield 
reliable migration in most cases, they usually do not yield an accurate 
migration for geological structures with strong lateral velocity 
variations or with steeply dipping reflectors. The results of the 
migration technique developed by Kosloff and Baysal shewed their 
technique to be able to accurately migrate these complicated geological 
structures. Furthermore, they found that there was no need to invoke 
complicated schemes in an attempt to correct the deficiencies of 
one-way equations [2] . 
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1.2 DESCRIPTION OF THE EttESENT STUDY 


Although the technique developed by Kosloff and Baysal provides an 
excellent migration algorithm, it still is a two-dimensional migration 
technique. The object of this research was to extend the 2D migration 
technique of Kosloff and Baysal into a 3D migration technique that 
would migrate a cube of 65,536 (or less) traces, each of length 1,024 
samples. This goal immediately imposed several problems that were much 
greater than extending the numerical methods of Kosloff and Baysal. Of 
these problems, execution time and data motion were the most 
significant. Although the 2D migration of Kosloff and Baysal was 
implemented on a Digital Equipment Corporation VAX-11/780 incorporating 
a FPS-100 array processor, with favorable processing time, it was 
observed that this hardware was much too small to expect it to handle 
the 3D technique in a reasonable amount of time. Consequently, for its 
high rate of computation, the CDC CYBER 205 located at Colorado State 
University (CSU) was chosen to be the target machine. In Chapters II, 
III and IV, the following aspects of the 3D migration technique are 
developed: (1) the numerical methods involved; (2) the major features 
of the program implementing the 3D migration technique; and (3) the 
results of numerical tests of the program. 


329 


II ns HUU0FP/BAS5AL KURUS mCH W OOB 


2.1 PRODUCTION 

Conventional finite-difference migration has relied on 
approximations to the wave equation which allow energy to propagate 
only downwards. Although generally reliable, such equations usually do 
not give accurate migration for structures with strong lateral velocity 
variations or with steep dips. The migration technique presented here 
is a three-dimensional extension of a two-dimensional migration 
technique developed earlier by Kosloff and Baysal [3] . Hie migration 
technique presented here, referred to in this paper as the KBF 
migration technique (for Kosloff /Bay sal Fourier type) , is based on the 
full acoustic wave equation, (2.1). 





<2 - i) 
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2.2 INPUT 


It is assumed that input to the KBF program consists of a "cube" 
of zero-offset traces in (x,y,z*0,t) space. Hie KBF technique 
presented here is designed to handle Nx * Ny such traces corresponding 
to Nx * Ny uniformly spaced points in the x and the y directions. Hie 
implementation discussed is designed so that the following must be truet 

32 <■ Nx <» 256 and Nx - 2* for some integer i 
32 <» Ny <■ 256 and Ny ■ 2 3 for sane integer j 

These restrictions were chosen so as to test program efficiency; 
they do not apply, in general, to the KBF schone. 

For each (x, y) pair, there will be N fc sample points in time, t^, 
m ■ 1* ..., N fc , at which values of pressure, PU.y.z^O.tjj,) are giyen. 
N fc must also be a power of two. 

In equation (2.1) it is assumed that the density, p» is constant 
and that the velocity function, c(x,y,z), will be provided by the 

user. For testing purposes, velocity is given by a Fortran function 
subprogram in the code presented in Appendix. Other forms 
representing the velocities may be used to replace the supplied 
function. 
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2.3 TOE KQSUQFF/ BAYSAL TECHNIQUE. IN 3D 


OBJECT OP THE BOaM 

Given P(x, y, 2*0, t) for t * 0, 1OT, 2DT, TMAX 
obtain P(x, y, z, t*0) for z * 0, 1DZ, 2DZ, ..., ZWOC 

BASIC RHRIOIL IKWDD 

Equation (2.1) is Fourier transformed with respect to tine, 
assuming density, p> is constant. The second order transformed 
equations can then be reduced to a system of first order equations in 
the usual manner. If density is constant, then we can write the 
following series of equations: 


P(x,y,z,t) * F”*P(x,y,z,w) 
■ jwF~*P 



-w 2 F" 1 P 
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where 



vV* 1 ? + " - “z F " 1 P 

-* + ^2? - - J§ P 

~ t[ I-] ■ [^-v- ;][£-] “ 

where 

"*“&♦& <2 ' 3 ’ 
which is of the form 

* f (z,v) (2.4) 
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The expression "transformed with respect to time” means that the 
functions P(x»y,z rt^) are represented by Discrete Fourier 

Transforms: 

Nt 

P(x,y»z»t ni ) m 2 pt x »y» (2.6) 
i-1 

where 

(m-1) DT for m * 1» 2, ...»!?fc ♦ l 

2 

(m-(N--fl))nr for m * + 2, ...» N*. 

* 2 

P is given by the Inverse Discrete Fourier Transform: 

Nt 

P(x»y»z»w i ) * n" P^y,*,^) (2.7) 

t n^l 

where 

2ir (i - 1) for i » 1» 2, ...» 5L + 1 
DTO t 2 

2* (i-(N t + 1)) for i » + 2» ...» N t 

OTN t 2 
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DT is the sampling interval in time; j » Equation (2.6) is then 

substituted in (2.1). This results in (2.2), which must be satisfied 

for each w A , for i « 1, + !• 

Thus, the N fc partial differential equations which provide a 

discrete approximation to (2.1) , involving unknown functions 

P(x,y,z,t ) are replaced by +1 partial differential equations 

2 

involving unknown functions P(x,y,z,w i ) . Note that in the transformed 
equations, dependence on time, t, has been eliminated. 

the "classical" 4 th order Runge-Kutta algorithm is applied to integrate 
equation (2.2) numerically in z. The (vector) computational equations 
are summarized below: 

K1 ■ Dz * f(z, v Qld ) 

K2«Dz*f(z + 2^ v Qld + 

K3 * Dz * f(z + 2 ? v old + f 2 ) 

K4 ■ Dz * f(z + Dz, v Qld + K3) 

v new * v old * (K1 + 2X2 + 2K3 + K4) / 6 


With an appropriate approximation to 


V 2 ? = 
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2.4 KBF DESIGN OUTLINE 


The program has four main subdivisions, whose tasks are summarized 
below: 


Part I : For each pair of (x,y) values, the corresponding 
zero-offset trace of P(x,y,0,t) values is converted to another "trace" 
of P(x,y,0,w) values by application of the discrete Fourier transform 
(2.7) . 


P^r.t II: For each w^ value (i=l,2,. . . ,N^) the p(x,y,0,w^) values 
are re-ordered into w^-slices organized either sequentially in y for 
each x, or sequentially in x for each y, as appropriate for further 
transformations. 


Par t III : Each w^-slice, frcm the transformed input cube of 

P(x,y,0,w^) values (see Figure 2.1), is developed into an (x,y,z,w^) 
cube of P(x,y,z,w^) values. This development is performed by 
integrating equation (2.2) numerically. Hie resulting P(x,y,z,w^) 
values are accumulated for all w^ for each (x,y,z) combination. Since 
all the related exponential multipliers e^ m i fc l equal 1 in magnitude 
(see equation (2.6)), this results in the generation of P(x,y,z,t=0) 
values, as required. (Note: t^ = 0) 
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There are two sub-problans of Part III: 


Part TTI.l : Initial values for ^-are obtained by the application 

of a two-dimensional Fourier transform to P followed by multiplication 
2 _2 

by SQFT[-1 * ( w 2 - y^)]. Evanescent energy components are then 

eliminated and is obtained by the application of a 2-dimensional 

inverse Fourier transform to 
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Pant III. 2: £(x,y,z,w) and £^ix,y,z,w) are propagated from z to z+ 
Vz using the Runge-Kutta 4^ order method to integrate equation (2.2) 
numerically. To do this 



must be approximated four times for each Vz. This is achieved by the 

use of a two-dimensional Fourier transform, followed by multiplication 

by -(k 2 +k 2) . Evanescent energy is eliminated from P by applying a 
x y 

two-dimensional Fourier transform to P, obtaining P. For all (K x ,Ky) 

pairs such that K + K > w./c(x,y,z), P is set to zero. Ihen a 
x y i 

two-dinfensional inverse Fourier transform is applied to yield P' , which 
is input to the next step of numerical integration. Evanescent energy 
is also removed from ^-in the same itanner. 

Part IV : For each (x,y) , the P(x,y,z,t=0) values in Part III are 
retrieved so as to be contiguous in Z. These space traces are each 
Fourier transformed and the downgoing energy is eliminated by filtering 
out components with negative wave numbers K z . Ihe resulting filtered 
traces are inverse Fourier transformed, retaining only the real part of 
the result, which is the desired 3D depth migration. 
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HI PROGRAM EBSDGH FEMHJFES 


3.1 INTRODUCTION 

The speed and capacity of the computer available to an individual 
researcher imposes certain restrictions on the types of problems that 
can be solved. The CYBER 205' s vector features and high speed scalar 
processor provide a tool for solving problems in a matter of minutes 
that would take on the order of days on a conventional scalar machine 
(this speed increase depends, to a considerable extent, on the degree 
to which it is possible to "vectorize" the scalar oode) . Of the 
problems that can new be solved using the CYBER 205, the migration 
application presented here makes extensive use of the CYBER 205' s 
vector facilities. This chapter contains an overview of vector 
processing on the CYBER 205 and an in-depth discussion of the data-flcw 
required by the KBF migration algorithm. 
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3.2 gpNCEP T S-QF 


This section deals primarily with the concept of vector machines; 
however, it is not within the scope of this paper to bring the novice 
up-to-date on vector computing. Several texts and papers have been 
written to perform that task. Hockney and Jesshope [4] present a 
comprehensive text covering vector and parallel processors as well as 
vector and parallel algorithms. Section 2.3 of Hockney and Jesshope 
[51 is dedicated to the GDC CYEER 205. For more information on the 
CYBER 205, see also Kascic 161. 

THE CDQ CYBER 205, HISTEKY 

The CYBER 205, announced in 1980, replaced its predecessor, the 
CYBER 203. In turn, the CYBER 203, introduced in 1979, was a 
re-engineered version of the STAR 100. Conceived in 1964, the first 
STAR 100 became operational in 1973. The instruction set for the 
vector operations in the STAR 100 were based, primarily, on the AIL 
language. The STAR 100 was designed to execute at a rate of 100 
Mega-flops (1 Mega-flop = one million floating point instructions 
executed per second) . 
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5HE GDC GffiSt 205, DESKS 

The CYBER 205 is a member of the family of "pipelined" machines. 
Pipeline refers to an assembly-line style of performing certain 
operations; thus more than one set of operands can be operated upon at 
a time. The vector processor of the CYBER 205 has what are kncwn as 
vector pipes. These vector pipes are designed to stream contiguous 
data elements (vectors) through their pipelines. Presently, the CYBER 
205 can have as many as 'four vector pipes, all of which can operate 
concurrently. A four pipe CYBER 205, processing 32-bit words, can 
operate at a peak rate of 800 mega-flops. 

Tfie various data types utilized by the CYBER Fortran 2.0 language 
include the following: 


Type 

Coiments 

Bit 

Half-word 

Full-word 

Double-precision 

Complex 

the machine is bit addressable 

32-bit floating point 

64-bit floating point; 64-bit integer 

128-bit floating point 

two consecutive 64-bit words 
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VECTOR OPERATIONS AND CONSIDERATIONS 


Vectors on the CYBER 205 are "pointed to" by vector descriptors. 
A vector descriptor is a 64-bit entity with the following two fields: 
(1) Vector length, which consists of 16 bits and (2) Virtual address of 
the first vector element, which consists of the remaining 48 bits. 
Thus, a vector can have a length ranging from 0 to 65,535. Note that a 
bit vector can be no longer than 65,535 elements even though it 
consists of only 1024 64-bit memory words. 

Vector operations come in a variety of forms on the CYBER 205, 
seme of which are displayed in Table 3.1. 

Table 3.1. Vector Operation Examples. 



DIMENSION A (100) , B(100) , C(100) 


L = 100 


EXAMPLE 


EQUIVALENT 

NUMBER 

VECTOR CDDE 

SCALAR (DDE 

(1) 

A(l? L) = Q8VINTL(0, 1; L) 

DO 10 I = 1, L 


10 

A(I) =1-1 

(2) 

B(l; L) = AQ? L) * 20.0 

DO 20 I = 1, L 


20 

B(I) = A(I) * 20.0 

(3) 

C(l; L) = A(l; L)*2.0+B(l; L) 

DO 30 I = 1, L 


30 

C(I)=A(I) *2.0+B(I) 
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I 


The examples in Table 3.1 are rather simple but resemble many 
operations in scientific programs. Examples 1 and 2 shew a vector 
function call and a vector-scalar operation. Example 3 shews a "linked 
triad" operation. A linked triad operation takes advantage of CYBER 
205 hardware which supports such operations. As one can see in Table 
3.2, the linked triad operations are quite efficient. An operation is 
generally considered a linked triad when it consists of two vector 
operands and one scalar operand. 

In certain situations, the results of some elements of a vector 
operation need not be saved. In this case, there is a mechanism for 
avoiding storage which involves a control vector. A control vector is 
a bit vector that specifies the storage of vector results. The control 
vector will be the same length as the result vector and where it has a 
value of one the corresponding result vector element will be saved and 
where it has a value of zero the corresponding result vector element 
will not be saved. The programmer also has the choice of reversing the 
meaning of the one's and zero's in the control vector. 

A certain number of clock cycles are needed to set up the vector 
pipes. As this setup time is constant for a given operation, it is 
more efficient, in terms of total execution time, to reduce the number 
of vector operations by increasing the vector lengths whenever 
possible. Table 3.2 shows the set-up times, as well as the timings for 
the actual operations for various operations on the CYBER 205. 
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Table 3.2. Vector Timing Information 


Vector Instruction 

Number of 
Set-up Cycles 

Number of 
Operating Cycles 

Addition, Subtraction 

51 

N / 4 

Multiplication 

52 

N / 4 

Division, Square root 

80 

N / .61 

Linked triad 

84 

N / 4 


Where: 

N = Vector length 

1 Cycle = 20 nano-seconds 

The vector operations are on 32-bit words 
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3.3 A NOTE CM TOE APPLICATION. PF.l 




R PROCESSING TO TOE KBF METHOD 


Hie KBF migration technique is such that almost all of the 
necessary operations can be vectorized. When working with a particular 
v-slice, all of the operations, including the two-dimensional FFT's, 
are vector operations. The computations performed at any given point 
of the cmega-slice must be performed at all of the points. If there is 
a certain criteria that causes something different to occur at a given 
cmega-slice point, a control vector can be created, dynamically, and 
the operation can still be performed in a vector manner. An example of 
this may be found in the routine CUTOFF where the evanescent energy is 
eliminated. In 6urtmary, there is no 

particular operation in the FBF migration scheme that can not be 
treated as a vector operation. To emphasize this point, one should 
examine the technique presented in chapter 2 and notice that there are 
no tricky operations that would prevent vector ization. In particular, 
it is important to note that there are no operations that have the 
following structure: 

DO 100 I ■ 1, N 
X(I) - F(Y (I) ) 

IF CX(I) .LT. VAL) GO TO 200 
100 CONTINUE 
200 CONTINUE 


The above code can not be efficiently vectorized because of the 
inherently sequential nature of the computations. 
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3.4 DATA C ONSIDERATIONS 


As previously discussed, a program implementing the KBF migration 
technique, extended into three dimensions, is easily expressed in terms 
of vector operations. The program developed here contains very few 
scalar operations, many of which are operations needed in order to 
control various vector instructions or vector subroutine calls. Having 
such a match of software to hardware, one might conclude that there are 
no remaining barriers to running the program. There are, however, a 
few major items that one tends to overlook, being overwhelmed by the 
computational power of the CYBER 205. The greatest of these is the 
data motion required to keep the CYBER 205 vector pipes busy. 

One penalty for the use of vector operations is that the data must 
be contiguous in memory for greatest efficiency (let alone for seme 
vector operations to run at all) . Furthermore, the vectors must reside 
in main memory as much as possible in order to prevent sure death from 
thrashing. With this in mind, one must realize that the memory 
requirement for the vectors that are necessary to perform a single step 
of the integration of one omega slice is quite large. For example, a 
(256 by 256) complex XY plane will require eleven vectors of length 
131,072 half-words. These, along with various support vectors, 
comprise 12 large pages (1 large page = 65,536 full-words). This is 
slightly less than half of the memory available to a user on a 
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2-megaword 205, however it is about all one can expect to get for any 
reasonable period in a time-sharing environment. But this is really 
just the tip of the iceberg - these are just the work arrays. The 
total data set consists of the input data cube, the work arrays, and 
the output data cube. 

Continuing with the previous example, the input cube could very 
well be of size 256*256*1024 half-words and the output cube could be as 
much as twice the size of the input cube (the size of the output cube 
depends upon the number of Z STEPS in the migration) . This would be a 
total of 201,326,592 half-words, which is equivalent to 1536 large 
pages. Obviously, this is much more data than any OfBER 205 can have 
in memory at any given time. Consequently, the question of how to 
handle the data-flow arises. A solution that one may consider is to 
declare the data cubes to be huge arrays and to let the virtual memory 
mechanism handle the data cubes. 

To consider declaring the two data cubes as arrays, one must 
realize that access to these two arrays would have to be in a 
contiguous manner. Otherwise severe thrashing would result. In the 
case of the KBF migration algorithm, access to the data cubes must be 
done in several ways that would break the rule of contiguous access. 
Thus, it would be wise to check into at least one alternate method of 
handling these data cubes as large arrays. 
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Before presenting the data motion method used in this study, the 
need for efficiency must be established. Continuing with the previous 
example and without discussing the code in detail, the subroutine RHS3 
takes on the order of 100 milli-seconds to run, each time it is called. 
In this example, RHS3 would be called on the order of 4*512*512 
(1,048,576) times. The time needed for all of these calls is 
approximately 29 hours. Thus, any time for performing the data-motion 
is added onto the 29 hours. Therefore, one needs to find a mechanism 
to perform the data-motion without making the program run for an 
unacceptable amount of time. 
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3.5 A POUR-WRY PARAUg. D ATA MOTION TECHNIQUE 


CYBER 205 Fortran provides several routines that may be used to 
implement I/O that runs concurrently with other instructions being 
executed as well as with other I/O. These routines include Q7BUFIN, 
Q7BUP0UT, and Q7WAIT. For detailed information on these routines, see 
the OX CYBER 200 FORTRAN VERSION 2 manual [7] . A typical use for 
these routines would be as follows: 


CALL Q7BUP0UT( ) 

CALL WORK ( ) 


In this example where the programmer wishes to write information 
out to a unit and have the routine WORK run concurrently with the I/O. 
In general, as long as WORK does not use the I/O unit referred to in 
the Q7BUP0UT call, it can do anything it wishes. Thus, there is CRJ 
activity concurrent to I/O activity. 

Another example where two I/O requests cause concurrent I/O, is as 
follows: 


CALL Q7BUFIN( ) 

CALL Q7BUP0UT( ) 
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According to the CDC CYBER FORTRAN 2 iranual [81 , these calls are 
legal, so long as they do not access the same data block on the same 
disk. Also, two Q7BUFIN, two Q7BUF0UT calls, or a Q7BUFIN and a 
Q7BUF0UT call can be active at one time for a given unit. 

It should be obvious that these "Q7" calls are the basis of a 
solution to the problem of data-flow that was presented in the previous 
section. Indeed, they are; yet they are only the basis of the method 
used in this study. Dr. Bjorn Hossberg [9] , of Control Data 
Corporation, wrote a utility known as SLICE4. Mossberg used the "Q7" 
utilities; however, the scheme he developed is much more elaborate 
than a series of Q7 calls to a particular I/O unit. 


aJ£E4 

It is not within the scope of this paper to duplicate Mossberg' s 
documentation of SLICE4. However, the concept and the terminology of 
SLICE4 will be presented as it applies to this study. For efficient 
operation, SLICE4 must be tightly integrated into the master program. 
Therefore, its terminology affects the view that one takes of the 
master program. 

In this study, two implementations of SLICE4 were needed and used; 
one for the input data cube and one for the output data cube. To 
explain the use of SLICE4, only the input data cube will be treated. 
The output data cube is 'handled in a similar manner. 
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sjcs 4 t aamnu CT 

The first step in using SLICE4 is to impose a coordinate system 
upon the data cube such that the cube is N1 by IC by N3 elements in 
size, where N1 is the number of elements in what one normally considers 
the Z direction, N2 is the number of elements in the X direction, and 
N3 is the number of elements in the Y direction. The next step is to 
define a second coordinate system on the data cube. Instead of being 
coordinates of individual data items, this second coordinate system 
gives coordinates of "super-blocks." Super-blocks are small cubes of 
the original data set. The super-block coordinate system has NS1 
super-blocks in the 1-direction, N32 in the 2-direction, and NS3 in the 
3-direction, where NS1 and NS2 must be multiples of four. *E3 does not 
have this restriction; however, for greatest efficiency, it should be 
one or a multiple of four. The reason for the multiple of four rule is 
that the super-blocks will reside on four different I/O units. No 
matter which direction the cube is accessed, each I/O unit will have 
one quarter of the super-blocks accessed. This is not the case when 
only a partial row or column of super-blocks is accessed; thus, it is 
most efficient to access a complete row or column. If it should happen 
that more than one I/O unit be controlled by a given controller, then 
SLICE4 will still execute, but in a less efficient manner (i.e. the 
parallelisn is partially inhibited) . Thus, one may access any four 
adjacent super-blocks at a cost which is one fourth the cost of 
accessing the same data with conventional techniques. 
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The super-blocks themselves have a coordinate structure imposed 
upon them. This coordinate structure is LI by L2 by L3. Where LI is 
the number of elements from the data cube in the 1-direction; L2 and 
L3 are defined in the same manner for their individual directions. 

Summarizing the terminology presented so far, the original data 
cube is broken up into NS1 by NS2 by NS3 super-blocks. Each 
super-block has Ll by L2 by L3 data elements. Thus the following rules 
must apply: 

ML = NS1 * Ll with NS1 = 4 * i, i => 1 
N2 = NS2 * L2 with NS2 = 4 * j, j => 1 
N3 = NS3 * L3 


SCTHR-HLOCX ACCESS 

The rows and columns of super-blocks are referred to as slices. A 
1-slice is seme column of super-blocks in the 1-direction, a 2-slice is 
some row of super-blocks in the 2-direction, and a 3-slice is sane row 
of super-blocks in the 3-direction. One may access all, or just sane, 
of the super-blocks of a slice via SLICE4. However, in this study, 
only the most efficient access is performed - accessing all 
super-blocks of a given slice. As access can be by any given slice, 
SLICJE4 must have the super-blocks all formatted in the same manner. 
Thus, when accessing a given slice, the slice is written into a buffer 
by SLICE4 and the user must re-format the data from the buffer into a 
work array in the format that corresponds to the direction of access. 
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DHBSHH GDNSUXRAXUHS 

One needs to be careful to have enough array and buffer space to 
access the data cube in all the necessary directions. Thus, the size 
of the super-block comes into question. The larger the super-block, 
the fewer accesses to the data cube are needed and vica versa. In this 
study, the LI dimension was 6et permanently to the value of 2. The 
reason for this is that, as one recalls from the migration technique, a 
complete XY plane is processed at any given time and there is only 
enough memory space to have two input planes in memory at the same 
time. 
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4.1 EXECUTION TESTS 

As discussed in section 3.4, it would take over 29 hours of 
execution time to migrate the maximum (assumed) data cube; thus for 
testing purposes, an input cube of size (€4x64x64) was used. For both 
of the test runs discussed here, all of the traces consisted completely 
of zeros, except the center trace that had a single wavelet peaking at 
sample 16 (in time) . Hie correctly migrated result, in this case, 
consists of a hemisphere. The first run (Figures 1 and 2) incorporated 
a padding in the time direction to delay the wrap-around effect 
inherent in Fourier algorithms. The second run (Figures 3 and 4 ) did 
not incorporate a padding - thus, wrap-around effects appeared. The 
first run took 240 CRJ seconds and the second run took 115 CPU seconds. 

Test Run 1 : The migration of the input cube described above, 
using a constant velocity of 3000 m/s, a Dz interval of 6.0 meters, a 
Dx interval of 12.0 meters, a Dy interval of 12.0 meters, and a time 
interval of 4.0 milli-seconds, yields the results shewn in Figures 1 
and 2. Figures 1 and 2 are slices of the output cube in the XZ and in 
the YZ directions, respectively, intersecting at the center of the 
output cube (Note the absence of the wrap-around effect) . 
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Test Run 2 : Die migration of the same input cube used in Test Run 
1 using tne same sampling rates in all dimensions, but with a velocity 
interface (see Figure 3; VI * 4000 m/s; V2 « 3000 m/s), yields the 
results displayed in Figures 3 and 4. Note the wrap-around effect 
present in these figures. 

4.2 FACTORS AFFECTING SPEED-PF QDMRTEATI0N 

Until a superior algorithm for performing the I/O required by the 
KBF migration algorithm appears, SLICE4 will remain the most efficient 
method available to perform the I/O task. However, should a CYBER 205 
ever be equipped with 8, or even 16, I/O channels, SLICE4 should easily 
be adapted to create SLICE8 and SLICE16 versions. Until then, there is 
little chance of decreasing the time required to perform the I/O. 

Other than I/O, the Runge-Kutta 4 th order algorithm employed in 
the KBF migration technique is the most expensive feature. 
Consequently, use of a less costly method for numerical integration 
(e.g. , a multi-point method, using the Runge-Kutta method to get 
started) might result in increased computational efficiency. 

4.3 CONCLUSIONS 

The 3D KBF migration program, implemented on the CYBER 205 
Supercomputer presented in this thesis, yields results that are 
consistent with those of Kosloff and Baysal [10J . This was confirmed 
by Kosloff [11] . Thus, a 3D migration program, using the KBF migration 
technique (based on the full acoustic wave equation) permitting lateral 
velocity variations is now available for use on the CYBER 205. 
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Abstract: 

In petroleum engineering, the oil production profile of a reservoir can be 
simulated by using a finite grided model. This profile is affected by the 
number and choice of wells which in turn is a result of various production 
limits and constraints including, for example, the economic minimum well 
spacing, the number of drilling rigs available and the time required to drill 
and complete a well. After a well is available it may be shut-in because of 
excessive water or gas productions. In order to optimize the field 

performance a penalty function algorithm was developed for scheduling wells. 
For an example with some 343 wells and 15 different constraints, the 
scheduling routine vectorized for the Cyber 205 averaged 560 t’mes faster 
performance than the scalar version. 
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I ntroduction: 


Mathematical modelling of the fluid production from a naturally occurring 
reservoir involves considering the reservoir as a network of interconnected 
blocks. To each grid block is associated a geologic description through 
properties, e.g., thickness, porosity, permeability, etc. Each grid block is 
considered to be in material balance with its surroundings, i.e., the amount 
of fluid in the block at time t +At is equal to the amount of fluid in that 
block at time t plus fluid influx in the time interval At minus fluid outflux 
in the time interval At. 

In Figure 1A, the reservoir is shown by a curved boundary. Overlaid 
areally is a rectangular grid. The sizes of the blocks can be chosen to 
represent the geological features of the reservoir as accurately as possible. 
Figure IB shows a two dimensional cross-section of a reservoir and the grid 
used for its simulation. Notice that the reservoir contains water, oil and 
gas in various regions, and only some blocks are in communication with the 
wells by means of perforations in the well bore. To simulate the production 
profile, the material balance of the grid blocks in which wells are perforated 
must also take into account the fluid production or injection. In this manner 
one obtains pressures and saturations for each of the grid-blocks. For 
details on mathematical modelling of oil reservoirs please refer to a standard 
text, for example, references 1 and 2. 

Once a reservoir simulator is formulated, it can be used in many ways, 

e.g.: 

1. Assist in making economic decisions for field operation, e.g., the 
investments to date at Pmdhoe 3ay exceed $9 billion. 

2. Design of production strategy. The effect of changes in the number, 
location, spacing, or timing of wells can be studied. 

3. Prediction of reservoir performance. 

4. Matching of the production history. 
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When an oil field is developed, of course the most important objective is 
to maximize oil recovery. However, this objective is tempered by limitations, 
economic and physical, e.y., costs and capacities of various installations and 
devices. 

The dashed curve in Figure 2 represents oil production when all wells 
flow at their maximum capacity. The area under this curve represents 
cumulative oil production. The ratio of cumulative oil production to in-place 
oil represented as a fraction or percentage is called the Oil Recovery 
Factor. If facilities were constructed for this production profile, they 
would have to be constructed to handle oil production at the maximum rate, 
q max . Economic considerations give us a target oil rate, q^, less than q max , 
at which oil production can be sustained for a period of time. The solid 
curve in Figure 1 represents this strategy. Note that sometimes this can be 
achieved without appreciable sacrifice in cumulative oil production. 

Well Scheduling Problem : 

Once q^. is established, the problem of optimal scheduling, i.e., selecting for 
operation a given number of wells (say n) can be represented mathematically as 
follows: 

Maximize, n 

JS q i £ q T 

1-1 

The maximum production rates of oil, gas and water are, however, limited to 
the capacity of the reservoir facilities. Thus, the field oil production is 
subject to constraints of the form: 

2 x i c li ^ L 
i 
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where, 

qi is the oil production rate from well i, 

q T is the target oil production rate for the field, 

Xi is either 1 or the gas-oil ratio or the water-oil ratio for well i, 
x-jqi is then the oil or gas, or water production/injection rate, 
and L is the oil or gas or water production of injection constraint. 

Some examples of these limits are: 

1. Fieldwide gas handling capacity, 

2. Water injection limit, 

3. Oil production limit at a station due to pipeline size, 

4. Gas-lift capacity available. 

In order to select wells for production, each well can be assigned a 
priority. In the penalty function approach priority assignment, is made with 
a function which becomes large as a particular constraint approaches violation. 

Suppose (k-1) wells have been already chosen. 

For choosing the k th well subject to a constraint of the form: 

x -j q i 4 L > 

a simple penalty function is: 

k-1 

p(k) = (2x iqi + x^q^)/ L 
i=l 

The penalty function p(k) has a value for each of the available wells, 
and arranges the set of available wells in order according to this particular 
constra int. 


364 



When there are several (say m) constraints, penalty functions p-|(k), 
P2OO — p m (k) can be obtained similarly. 

Since each constraint is individually fatal for well scheduling purposes, 
the violation of one constraint is as bad as any other. 

Hence, an overall penalty function can be of the form: 

p(k) = max Pj 

j — 1 ... m 

Results and Discussion : 

The implementation of this scheme involves calculating for each available 
well, m different Pi (k) and then obtaining an overall penalty, p(k) as the 
maximum of these m values. Thereafter the well with the lowest value of p (k) 
is selected. This procedure is repeated selecting one well at a time until 
the target rate q is achieved without violating any of the constraints. If 
the target rate cannot be achieved without violating one or more constraints, 
we are on the decline portion of the production curve. 

This scheme was programmed into a three dimensional, three phase (oil, 
gas, water) simulator. The simulator originally used a simple prioritization 
scheme based on gas-oil ratios. When a scalar version of the penalty function 
algorithm was introduced, the simulator ran appreciably slower. It was 
therefore decided to vectorize the penalty function algorithm. 

To calculate the penalty function in a case with n wells and m 
constraints declare an array p (n, m). Usually n is much greater than m. 

For each of the m constraints vectorize the penalty calculation, e.g., 
for constraint i, store the values of p -j ( k ) in the elements of p (n, m), 
starting with p (1, i) and ending at p (n, i). 

Next, using a WHERE comparison statement pick out the largest of the m 
values for each well. We now have the priority p (k) for each well. Use the 
Q8SMINI call to pick out the minimum value. If this value exceeds 1., no well 
can be chosen without violating a constraint. 
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TABLE 1 . 



Case 1 

Case 2 

No. of wells 

119 

343 

No. of constraints 

9 

15 

Average Well 
Selection Time (sees) 
Scalar: 

Vector: 

.14 

.001245 

1.6 

.00287 

Scalar: Vector 

Ratio 

112 

560 


A summary of results for two cases is presented in Table 1. For a 
reservoir with 119 wells and nine constraints, the vector algorithm was on the 
average 112 times faster than the scalar version. For a larger example. Case 
2 in Table 1, 343 wells with 15 constraints, the vector algorithm achieved 

even more spectacular results, an average acceleration factor of 560. 

The details of Case 1 are represented graphically in Figure 3. In the 

scalar algorithm, the time required for selection of wells increases 

monotonically for each subsequent selection. The selection of the first well 
required only .005 secs while the selection of the 65th well required .226 
secs. However, in the vector algorithm, each well selection required .001244 
secs, except for the first, which required .00155 secs. 

Similarly, for Case 2, the vector algorithm took .00287 secs for each 

well selection, except for the first well, for which it took .00447 secs. The 
scalar algorithm had a monotonic increase from .0185 secs for the first well, 
to 2.641 secs for the 220th well. This means that the selection of the 220th 
well was some 920 times faster in the vector algorithm as compared to the 

scalar version. 



Conclusions: 


Clearly as the number of wells and the number of constraints increase, 
the advantage of the vectorized version over the scalar version becomes 
greater . 

The reservoir simulator with the vectorized well selection scheme, 
including the more complicated penalty function scheme, ran faster than the 
original version with the simpler scalar well selection scheme. 

In short, judicious use of vectorization can make feasible highly 
desirable enhancements to larye simulators. 
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FIGURE 1 A. 

RECTANGULAR GRID TO REPRESENT 
A RESERVOIR. EACH BLOCK MAY 
HAVE DIFFERENT THICKNESS AND 
POROSITY. 



CROSS-SECTION OR A GRID WITH 
DIFFERENT TYPES OF WELLS. 
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